I can't see the random forest for the trees
Craig Suckling
UK Government CDO | Driving change in Government, Industry, and Society with Data and AI | Ex-Amazon | DataIQ Top 100 in Data 2020, 2021, 2023 | NED | ???? in ????
Part 2 in A Data Strategy for AI
In the first part of this blog series we covered how data for AI is the new “data is the new oil”. But with an exponentially growing wealth of data ready to drive the next generation of value we can easily get lost in the bloat of data, driving costs sky high, whilst seeing little in return.?
Data needs to be harnessed, curated, and pointed intentionally at the right business problems, in the right format, and at the right time. There is nothing new in that statement. With AI all the usual data challenges still apply that prevent delivering value fast, and scaling broadly. Legacy systems and siloed data are a blocker to accessing data needed to train and fine tune models. Central governance teams can't keep pace with business demands for new AI use cases, whilst also being experts in a variety of multi-modal data. Business appetite for AI continues to increase, but trust in data quality remains low, and lineage is fuzzy. The list goes on, and these challenges compound each other, frustrating efforts to shift from AI PoCs to production scale.
To get your organisation through the AI PoC chasm requires adopting 3 principles 1/ Keeping a high bar on the data you choose to invest in, 2/ Storing data in the right tool for the job, with interoperability 3/ Enabling uniform and autonomous access to data across the organisation.
The good news! AI is increasingly available to fix the data “back office” problem and speed up our ability to deliver customer facing “front of house” AI use cases. Put another way, AI is folding back on itself to help solve the challenges that impede AI from scaling.
1/ Keep a high data bar.
The manufacturing industry has been one of the earlier adopters of AI. Progressed manufacturing organisations use AI to reduce product defects, speed up production, and maximise yield. Achieving these goals at scale without intelligent automation is a challenge as human quality managers can only scale to a certain point to cater to the demands of complex parallel manufacturing processes, detailed inspection requirements, and multivariate problem solving, all whilst adhering to health and safety policies. AI has proven to be highly effective in augmenting the role of quality managers across manufacturing. McKinsey research shows some Manufacturing companies have been able to realise 20-40% savings in yield loss with AI agents assisting alongside quality managers.
In the same way that AI has helped how physical products are manufactured, AI can assist in the data prep process to optimise the yield of high quality data products that can be used for AI use cases. (Side note: There is precedence for this AI for AI scenario. In the chip manufacturing industry, NVIDIA uses AI to optimise the production of GPU chips built to train AI).?
We now have the opportunity to automate manual heavy lifting in data prep. AI models can be trained to detect and strip out sensitive data, identify anomalies, infer records of source, determine schemas, eliminate duplication, and crawl over data to detect bias. There is an explosion of new services and tools available to take the grunt work out of data prep and keep the data bar high. By automating these labour-intensive tasks, AI empowers organisations to accelerate data preparation, minimise errors, and free up valuable human resources to focus on higher-value activities, ultimately enhancing the efficiency and effectiveness of AI initiatives.
2/ Right tool for the job
AI models need data for a host of different reasons. Data is not just needed during training cycles for building new models. Data is also required for fine tuning foundation models to increase relevance in the context of a specific functional domain, and it is needed during model inference to create contextual grounding and remove hallucinations with techniques like Retrieval Augmented Generation (RAG), and Memory Augmented Generation. These different usage categories all require different modes of data storage.?
领英推荐
Training a new model requires various data forms, from large piles of text data, high quality image, audio, and video data that must be stored in standard formats that can be efficiently processed by training pipelines. Cloud based data lakes suit the needs of training data well as they can handle the terabytes or petabytes of raw data that go into training AI models with requisite scalability, durability, and integration capabilities. In comparison, data required for fine tuning generative AI models is typically restricted to a domain specific data, representative of the task or domain you want to tune the model for, such as medical documents, legal contracts or customer support conversations. Typically data for fine tuning needs to be labelled and structured to guide the models learning towards a desired output making relational, or NoSQL document databases the better choice for fine tuning. Retrieval Augmented Generation (RAG) relies on knowledge bases which are a structured collection of information (e.g. documents, web pages, databases). RAG stores need to also support efficient information retrieval with techniques such as indexing, searching and ranking, and they also need to capture the relationship between user input, retrieved knowledge and LLM output. For this reason Vector Databases (which use embeddings to preserve semantic relationships) or Graph databases (which use a structured graph for relationships) are well suited for RAG.?
Add this to the plethora of other BI, analytics, and traditional Machine Learning use cases that support other data insight and intelligence work across an organisation and it quickly becomes evident that variety really matters. Providing AI developer teams with diversity in choice of data storage applications is imperative to matching the right tool for the job at hand.?
3/ Uniform, autonomous access (with security)
AI is being experimented with and adopted broadly across organisations. With so much activity and interest it's difficult to centralise work, and often centralisation creates bottlenecks that slows innovation. Encouraging decentralisation and autonomy in delivering AI use cases is beneficial as it increases capacity for innovation across many teams, and embeds work into the business with a focus on specific business priorities. This can only work however with a level of uniformity in data across the organisation (see more on the balance of autonomy and uniformity in this previous post).
Businesses need to standardise on how data is catalogued, tagged with metadata, discovered, and accessed by teams so that there is a level of consistency in how data is interpreted and used in AI use cases. This is especially important for the role AI plays in augmenting work that goes across departments. For example, to realise a seamless customer experience across sales, accounts, and support teams requires all teams working on AI use cases using a common definition of the customer, and their purchase and support history. To optimise a supply chain across demand forecasting, inventory management and logistics planning teams require consistency in data products for supplier, SKUs, and customer orders.??
Increasing uniformity in data products across functions lets organisations improve the quality and reliability of their AI models. This combined with providing choice in data tools to meet with the AI task at hand, and maintaining a high quality bar on data allows for AI work to be conducted with increasing autonomy, delivering on greater business value faster.?
In the next post in this series we will continue to move up the stack, away from data sourcing, and management, and into activation of value. Next up is Agentic AI, how this is transforming how we think about conducting data work, and augmenting how we use data to deliver on business value.
(Views in this article are my own.)