Data Integration for AI at Scale – Identification 4 of 8
AI builds on computing and data. Computing was discussed last time, it is now time to explore data capabilities.
Rather than Data Management, the title Data Integration has been chosen to highlight consumption perspective. Emphasis is on AI data needs rather than on other data consumers like BI or direct data asset monetization – not that they wouldn’t matter.
In reality, however, all data management best practises apply. AI as the primary customer does not change that. On the contrary, in order to get data succesfully integrated to power those hundreds of AI use cases, many things need to click. As the exploration on AI data needs reveals, the bar is set high on everything about data and related capabilities like data governance.
Because of the apparent symmetry between computing and data, the structure of this article follows that of computing: Needs, Solutions, Constraints and Strategy. Contents is new but contexts are largerly the same.
The structure of the article follows that of computing: Needs, Solutions, Constraints and Strategy
AI data needs
Exploring AI data needs starts from identifying contexts. With computing, the primary contexts were AI model training and inference. Because of the symmetry between computing and data, there are no reasons to choose otherwise. In addition, computing environments of products, services and processes form the underlying contexts.
However, the symmetry between computing and data ends there. Due to AI technology evolutionary steps and great diversity of AI use cases, the picture of AI data needs is more nuanced compared to AI computing needs. In other words, the full picture of AI data need emerges from the synthesis of AI model training and inference as contexts, but now combined with variance created by AI technology evolution steps and use case specific needs.
The picture of AI data needs is more nuanced compared to AI computing needs
AI at Scale has both qualitative and quantitative definition. The former is about embedding AI in all aspects of value creation. The latter entails tens or even hundreds of concurrent AI use cases. When combined with AI technology evolutionary steps, the total landscape of AI data needs starts to take form.
At the end of the day, understanding AI data needs is about strategic clarity. Complexities and uncertainties associated with the Age of AI make achieving clarity important. In reality, serving all AI data needs may take long due to capability build-up lead time. But there are no excuses for not understanding the needs themselves.
Understanding AI data needs adds to strategic clarity
Article What matters now: AI at Scale captures the essence with this: “Data, AI and software engineering practises are foundational in achieving AI at Scale. Rather than based on improvisation or artesan-like tailoring and tinkering, these operations need to be industrial-grade from design to development and from testing to deployment.”
That’s it in one word: Industrial-grade. AI Engineering and Software Engineering will be explored in future articles, let’s start with data capabilities.
AI at Scale builds on industrial-grade data, AI and software capabilities.
AI model training
Creation and refinement of high-performance AI models require a lot from data. Here’s an outline of the key factors:
Inference
Inference is about application of trained models to new, unseen data to crete predictions and insights, or in the case of generative AI, new data. While model training requires large, often historical datasets, inference has more specific and varied data needs depending on the use case. Here’s an outline of key factors:
Technology evolution perspective
As AI evolves from Traditional Analytics (TA) to Machine Learning (ML), and further to Deep Learning (DL) and Generative AI (GAI), the data needs become increasingly complex and demanding. Each evolutionary step brings its own data requirements, shaping the data solutions necessary to support AI at Scale.
Traditional Analytics
Analytics, often classified under descriptive or diagnostic analytics, focuses on interpreting past data to derive insights, typically through business intelligence tools or statistical methods. Data needs for analytics are less complex compared to ML, DL or GAI.
Training data needs in TA include things like:
Correspondingly, inference data needs in TA:
Machine Learning
ML involves learning patterns from data using statistical models such as regression, decision trees or support vector machines (SVMs). It is widely used in structured data environments like finance and retail.
Training data needs in ML:
Correspondingly, inference data needs are as follows:
Deep Learning
DL involves neural networks that automatically learn complex patterns from vast amounts of data, significantly reducing the need for human-designed features. DL is the foundation for tasks like image recognition, natural language processing, and complex time-series analysis.
Training data Needs in DL:
DL inference data needs:
Generative AI
Generative AI goes beyond recognizing patterns in data to creating new data: text, audio, images and video. This adds new dimensions to data needs.
Generative AI adds new dimensions to data needs
GAI training data needs:
Correspondingly, GAI inference data needs are as follows:
In conclusion, advances in AI technology not only demand a lot more from computing but also from data. Powering state-of-the-art multimodal Generative AI use cases is a non-trivial data integration task.
Multimodal Generative AI use cases require a lot from data integration
Use cases perspective
Taking use cases perspective to AI data needs is a powerful way to complement perspectives discussed above. By looking at AI data needs through the lens of specific use cases, it is possible to highlight how different applications have distinct data requirements, even though they may share underlying AI techniques.
Rather than relying on narrow focus of context-specific use cases, the granularity deployed here is generic use case discussed in an earlier article on AI technology evolution . That leads to much better overview on the relationship between AI use cases and their typical data needs. That is not to say that detailed data needs are eventually always context-specific.
Natural Language Processing (NLP)
Context-specific use case examples: Text classification, sentiment analysis, machine translation, chatbot.
Data needs:
Image and Video Processing
Context-specific use case examples: Object detection, image classification, facial recognition, video surveillance analysis.
Data needs:
Speech Recognition
Context-specific use case examples: Voice-to-text, virtual assistants, real-time transcription, voice commands.
Data needs:
Automation
Context-specific use case examples: workflow automation, intelligent assistants, autonomous vehicles.
Data needs:
Multimodal User Interface
Context-specific use case examples: UI that combines text, voice and visual inputs, e.g. AI assistants with visual components.
Data needs:
Recommendation Systems
Context-specific use case examples: E-commerce product recommendations, content streaming platforms, personalized ads.
Data needs:
Anomaly Detection
Context-specific use case examples: Fraud detection, network intrusion detection, manufacturing defect detection.
Data needs:
Robotics
Context-specific use case examples: Industrial robots, autonomous drones, robotic surgery.
Data needs:
Digital Twins
Context-specific use case examples: Virtual models of physical assets for real-time monitoring, predictive maintenance, simulations.
Data needs:
Predictive Analytics
Context-specific use case examples: Sales forecasting, customer churn prediction, supply chain optimization.
Data needs:
Operational Process Optimization
Context-specific use case examples: Manufacturing line optimization, resource allocation in logistics, scheduling.
Data needs:
Resource Usage Optimization
Context-specific use case examples: Energy consumption optimization, water usage management, cloud resource optimization.
Data needs:
AI data solutions
Simplified, AI data solutions are about serving AI data needs across all contexts from training to inference, and across all technology evolutionary steps and AI use cases. As we will shortly discover, the way data solutions answer the call is diverse and may appear complex if not addressed with rigor.
AI data solutions are about serving AI data needs outlined above
Some solutions are foundational without realistic alternatives, some come with alternatives and options. Some solutions operate more in the background, others in the frontline with direct involvement in operationalizing AI training and inference.
In the context of AI at Scale, data solutions’ scalability is naturally in focus. Turns out that there are fundamental differences in that respect – to be taken into consideration when AI at Scale is set as strategic objective.
To keep this article reasonably compact, solutions are merely introduced. That is, they are Identified. It for later articles to explore their Acquisition (how to obtain digital capability in question), Configuration (how to organize and structure the capability), and Management (how to maintain the capability for optimum long-term results).
This article is about Identification. It is for later articles to explore the aspects of Acquisition, Configuration and Management of data solutions alongside other digital capabilities needed for AI at Scale.
Data Governance
Data Governance is about those underlying and often unvisible structures, processes and practises without which the data integration frontline would soon collapse. AI at Scale relies on rock solid data governance practises and there’s little room for compromises.
AI at Scale relies on rock solid data governance
As is in the case of corporate governance, significant failures in data governance may fester into a catastrophe. Less dramatic outcomes connect to data governance shortcomings leading to unreliable data for AI models, compliance risks, and operational inefficiencies.
Data Governance is an enabler for multitude of things including data quality, consistency, security, privacy and regulatory compliance. However, thru ownership, accountability and initiative data governance connects also to data relevance and business value.
Turns out that there’s a superstructure beyond data governance: Company operating model. It goes without saying that these structural elements need to be carefully aligned. Operating Model and related structural elements will be discussed in later articles. For now, they are assumed rather than directly observed.
Data governance is to enable data quality, consistency, security, privacy and compliance but also ownership, accountability and initiative.
Data governance is best seen as a framework encompassing policies, standards and practises that ensure effective and efficient use of data within an organization.
Key data governance design parameter is this: It should facilitate innovation and progress rather then impede them. Poorly designed or implemented data governance may result in perfect regulatory compliance but with excessive harm to business. Thererefore, data governance framework needs to be agile with streamlined operations and maximum amount of automation. All practises and standards must work to improve efficiency rather than add control for control’s sake.
Data governance implementation in the absence of business acumen may lead to serious shortcomings. Therefore, data governance implementation in the context of AI at Scale connects to strategic management.
Data governance is to facilitate rather than impede digital innovation
Overall, data governance framework consists of following key elements:
Ideal data governance implementation scales thru decentralization and deploys high degree of automation. For federated data governance with automated policy enforcement, see Data Mesh below.
High performance scalable data governance utilizes decentralization and automation
Batch data integration
Batch data integration is the process of consolidating data from different sources to provide a unified and coherent view over the data. Integration is basically about breaking the data silos by bringing the data together from multiple disparate sources. The sources can be anything from legacy databases to IoT sensors and from operational ERP modules to commercial 3rd party data.
Batch data integration deals with large datasets. As discussed above, AI model training is largely based on batch data rather than real-time data. Therefore, batch data integration is a critical enabler for virtually all AI model training. Somewhat depending on the technology evolutionary step and use case, AI model training may require enormous volumes of highly diverse data, including unstructured data like images, videos and text. Batch data integration solutions must handle this scale and diversity efficiently.
Batch data integration is critical enabler for AI model training
In addition to unified view, other key batch data integration objectives include accuracy, consistency, timeliness and efficiency. Or to put it in another way, making sure that AI training gets high-quality data in large enough quantities when it needs it.
Batch data integration can be depicted as data pipeline with raw data extracted from multiple data sources, transformed to suit model training needs, and then loaded into data repository to be accessed later – the so called Extract-Transform-Load (ETL) pipeline.
There are several solutions available for each of those stages. For example, Apache Spark for data transformation and various data warehouse and data lake solutions to store data once it has been processed and transformed. Data warehouse is used for tradional stuctured data whereas data lake can store unstructured data as well. Combination of data warehouse and data lake is sometimes called lakehouse.
ELT is another pipeline variant where data transformation happens after it has been stored in a data repository. Of the two, ETL is traditionally used when transformation logic is complex and the volume of data is manageable, whereas ELT is more suited for large datasets.
Interestingly, more complex transformations can even utilize AI models by themselves. In such scenario, an AI model would act both as a data consumer and as a data source.
Complex data transformation may utilize AI model by itself
Real-time data integration
Real-time data integration and batch data integration differ significantly in their architecture, methods, objectives and often also in contexts. Real-time data integration core objective is to enable low-latency data ingestion, transformation and processing to meet real-time AI inference needs. In comparison, AI model training relies mostly on historical batch data although not exclusively.
Need to minimize latency is what defines real-time data integration. That is, real-time data integration solutions are optimized for low latency rather than large amounts of historial data that goes through elaborate transformations. While batch data integration can take the time needed for complex data transformations, real-time integration targets at simpler and quicker action.
Need to minimize latency defines real-time data integration
AI use cases perspective discussed above recognized the need for real-time data across the board: Virtually all listed use cases require real-time data – if not for training then latest for inference.
领英推荐
However, also in the context of inference, there’s significant amount of variance. AI use cases designed for immediate responses require real-time data. Correspondingly, use cases such as predictive analytics can be based on historical data and batch data integration.
An overview of key differences between batch and real-time data integration looks like this:
Traditional ETL/ELT pipelines are generally not used for real-time processing, primarily because the transform stage in them takes too much time. Instead, Stream Processing techniques and tools like event streaming platforms are employed to handle continuous data streams from e.g. IoT sensors in edge devices or on-line customer interactions. Low-latency data APIs can be deployed to access data needed for real-time inference.
DataOps
DevOps has firmly established itself as software development practise to bring development and operations together to improve the speed, agility and quality. DataOps extends many DevOps practises and methods to data. Further, MLOps builds on both of them in order to operationalize and scale up AI model development and deployment.
Although DevOps, DataOps and MLOps are conceptually separate, they are closely interlinked in practise. Therefore, full “XOps” discussion will have to wait until the article on Operating Model for unified and consistent discussion of the trio. In the meantime, short introduction to DataOps will have to suffice.
Core DataOps objective is to orchestrate data workflows and automate pipelines, ensuring industrial-grade scale, quality, agility and speed across data operations – enabling innovative data solutions and AI use cases. AI at Scale cannot be based on artisan-like ad-hoc tailoring and tinkering.
DataOps brings industrial-grade scale, quality, agility and speed to data operations – enabling AI at Scale.
Just like with DevOps, Continuous Delivery is the central element of DataOps. Continuous Delivery builds on set of interlinked activities including automation, computing environment configuration, version control of everything, operation monitoring, and roll-back. Continuous Delivery enables repetitive experimentation and learning that are in the heart of digital innovation.
Continuous Delivery with DataOps is key enabler for digital innovation
Data Mesh
Data Mesh signifies paradigm shift in data management and integration. By allocating data ownership to business domains, Data Mesh facilitates data-driven initiative and innovation like nothing else before. With decentralization at its core, Data Mesh has scalability built in. For in-depth background, see Data Mesh book review and beyond .
Data Mesh supports AI at Scale natively. Not only as a data management solution but also indirectly through implied decentralized operating model. Full discussion on operating model needs to wait for later article. For now, it is sufficient to note that decentralized operating model for AI at Scale and Data Mesh are not only fully aligned – they are also highly synergistic with many key concepts from ownership to shared semantic understanding.
Data Mesh and decentralized operating model for AI at Scale are highly synergistic
Achieving AI at Scale without Data Mesh is possible but not efficient or effective in long term. See discussion on Data Fabric and Data Mesh below.
Data Mesh builds on four cornerstones:
Summarized, Data Mesh is a combination of 1) Decentralized Operating Model consisting of domain-driven organizational structure, distributed governance and policy enforcement, platform engineering, and practises for data product lifecycle management, 2) Architectural design for domain structure, data products and data mesh platform, and 3) Platform technology, including modern software development methodology enabling data product development, sharing and portability.
Data Mesh is a combination of decentralized operating model, architectural design, and platform technology.
Data cataloging and metadata
Data cataloging serves as essential enabler for data organization, governance and discovery. Data cataloging serves multiple purposes, including:
Metadata, i.e. data about data, plays a central role in data cataloging and in data management and integration in general. Metadata falls into several categories:
In traditional contexts predating Data Mesh, a data catalog is typically associated with data warehouses and data lakes stroring structured and unstructured data in centralized repositories. Organizationally, data catalog maintenance is then performed by centralized IT and data teams.
Data Mesh fundamentally changes the way data cataloging is done by shifting from centralized to decentralized operaring model. With distributed data ownership, responsibility for data asset maintenance and data cataloging moves to business domains. Metadata management remains critical but is now done in a federated manner. Data products carry their own metadata and must comply with overarching governance policies that are set centrally but applied locally – thru Federated Computational Governance, as discussed above.
While Data Mesh decentralizes data ownership, data discovery across all company data assets remains essential. Therefore, a centralized data catalog would still exist, but its role shifts to enabling federated discovery mechanism where data products registered by each business domain are discoverable enterprise-wide.
Data Mesh decentralizes data ownership but continues to rely on centralized catalog for data products to be discoverable enterprise-wide.
Data Fabric vs. Data Mesh
Data Fabric builds on traditional data cataloging by applying automation and AI-driven insight to make data discovery and access even more easier. In effect, Data Fabric adds semantic layer over existing data architecture consisting of data warehouses, lakes and catalogs. In this way, it aims to address the cardinal sin of centralized data storage solutions: semantic understanding lost forever – with data lakes turning into data swamps nobody wants to touch with a ten-foot pole.
Data Fabric addresses the cardinal sin of centralized data storage solutions: semantic understanding forever lost
While representing significant improvement, from AI at Scale perspective Data Fabric remains only a partial solution with high intermediate potential but lacking characteristics of an ultimate solution.
Why is this?
Because Data Fabric addresses only relatively small subset of the overall AI at Scale challenge on digital capabilities. In order to achieve true AI at Scale, digital capabilities are to be perceived holistically rather than with data management perspective only. Because of that limited perspective, Data Fabric remains a point-solution – valuable and effective but not sufficient.
Because of limited perspective of data management, in the context of AI at Scale, Data Fabric remains a point-solution.
Conversely, while Data Mesh is not a complete solution either, it aligns with complete AI at Scale solution in ways that Data Fabric does not. Rather than starting from data management, by assigning data ownership to business domains, Data Mesh in effect starts from operating model. And that makes all the difference.
Data Mesh aligns with complete AI at Scale solution
In terms of adding semantic understanding, the goal and outcome of Data Fabric and Data Mesh are, if not equal, at least very similar. But the path there is completely different. Data Fabric does it with adding AI-driven insight on existing architecture and infrastructure, Data Mesh does it by redistribution of data ownership itself.
So, would adding semantic understanding be enough to achieve AI at Scale?
No, it would not.
To truly scale things up, there needs to be ownership, incentives and initiative – all linked tightly with expertise within the Bounded Context of a business domain. Scale emerges from business domains becoming engines of digital innovation and value creation.
Scale emerges from Business Domains becoming engines of digital innovation and value creation
That is the solution proposed and assumed by Data Mesh. A solution that can lead to hundreds of sustainable AI use cases. Full solution description will be included in the article on operating model.
To scale things up, semantic understanding needs to be accompanied with ownership, incentive and initiative.
Data storage
Using the AI technology evolution perspective discussed above, it is possible to assess how data storage solutions have evolved over the years to serve increasingly demanding AI needs.
As AI evolves from Traditional Analytics (TA) to Machine Learning (ML), and further to Deep Learning (DL) and Generative AI (GAI), each evolutionary step introduces more requirements on data storage solutions. In addition, real-time inference sets specific low-latency requirements to storage solutions.
Traditional Analytics and structured data storage
Traditional Analytics – which is primarily descriptive and diagnostic – relies heavily on structured data. It utilizes data sets that are often tabular and relational, such as those found in SQL databases. At this stage, AI models are relatively simple and data needs are straightforward.
Storage solutions include Relational Databases such as MySQL that are ideal for structured and relational data. They are designed to handle tabular data and enable efficient queries with SQL.
Data Warehouses like Amazon Redshift and Google BigQuery can be used to store massive amounts of structured data. They serve large-scale analytics tasks based on batch processing at scale.
Corresponding AI models cover linear regression, logistic regression and decision trees that utilize structured data efficiently.
Machine Learning and storage for semi-structured data
As AI models evolve, so does the complexity of the data they consume. Semi-structured data – such as customer reviews and logs – provide richer insights based data that goes beyond relational tabular data thus requiring more flexible storage solutions.
Storage solutions include NoSQL Databases such as MongoDB and Cassandra. These databases are designed to handle semi-structured data with high flexibility and scalability. They are particularly useful when data does not fit neatly into relational schemas.
NoSQL databases are highly scalable, making them suitable for AI models that utilize large datasets distributed across multiple servers.
ML with semi-structured data is good for things like classification, anomaly detection and recommendation systems.
Deep Learning and unstructured data storage
Deep Learning, enabled by neural networks, excels on large amounts of unstructured data such as images, audio and video. This type of data cannot be handled by traditional databases, pushing the need for more advanced storage solutions.
Solutions include Object Storage such as Amazon S3 and Azure Blob Storage that are designed to handle large volumes of unstructured data like images, videos, audio files and documents. They are scalable and optimized for storing binary large objects (BLOBs). In addition, Distributed File Systems can store unstructured data across multiple nodes, ensuring redundancy and scalability.
With the addition of unstructured data storage capability, Data Warehouse evolves into more versatile Data Lake.
Corresponding AI models involve things like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) used for computer vision, speech recognition and natural language processing.
Vector databases for Multimodal Generative AI
Multimodal Generative AI requires new type of data storage: Vector database supporting vector embeddings to represent semantic meaning of text, images, audio, and video in a unified (vector) space. Representations of data as high-dimensional vectors enable contextual understanding across various data modalities, including text, image, video and audio.
Contextual semantic understanding and ability to associate meaning across different data modalities is both fascinating and revolutionary.
Such capability unlocks transformative AI use cases, from text-to-video tools to sophisticated virtual assistants interacting with the digital world, to humanoid-like robots operating efficiently and safely in the physical world. In all of these cases, multimodal AI models can interpret and act on inputs of varying modality with deep contextual awareness.
By converting each type of data into high-dimensional vectors, AI models can not only process but also associate meaning across different data modalities in a cohesive way. An AI system trained on language models can seamlessly associate its understanding of text with what it sees via machine vision or what it hears through speech recognition.
This will eventually result in a humanoid-like robot, capable of interacting with digital and physical worlds alike, understanding instructions not only based on text or voice but also on visual cues from its surroundings. That is, a physical AI agent that can speak about what it sees and act on what it hears, all while maintaining consistent understanding of its environment.
Storage for real-time inference
For real-time inference, storage solutions need to support high-speed data access and streaming capabilities. AI models require quick and efficient access to data without the delays that can occur in traditional storage systems.
Storage solutions include In-Memory Databases, such as Redis, that store data in RAM, enabling ultra-low-latency access. In addition, Time-Series Databases, such as Prometheus, are optimized for storing and querying time-stamped data from e.g. IoT sensors.
Graph database for relationships-oriented use cases
Finally, Graph Database is designed to model and store complex relationships within data using a graph structure, with nodes (representing entities) and edges (representing relationships between those entities).
Graph databases such as Amazon Neptune efficiently handle highly interconnected data by directly representing relationships, making it easier to query and explore connections in large, complex datasets.
Graph databases are often used alongside other storage solutions from structured data based machine learning all the way to Multimodal Knowledge Graphs where graph database store cross-modal relationships, while vector database manages vector embeddings for semantic search.
Overall, graph databases are essential for specific AI use cases where relationship analysis drives value.
Distributed inference at scale – case Nextdata
As discussed in Computing for AI at Scale , inference is fundamentally about distributed computing. In that context, AI model portability across multitude of computing environments emerged as key capability. However, the very same question applies to data: How to make data available for inference taking place potentially in thousands of edge devices?
Distributed inference manifests itself in two distinct ways:
Ideal connectivity capability is easy to point out: always available, no bandwidth limitations, and very low latency. However, the closer we move towards edge the less common such ideal becomes. Subsequently, data portability emerges as an alternative to API based data access.
In system design terms, the key question then becomes: Do we rely on APIs to access inference data, or do we instead place the data physically at the point of inference?
Answer depends on several design parameters:
With good enough edge connectivity in place and/or when AI use cases are not sensitive to connectivity limitations, the traditional API-based access to inference data is the default choice. Having the data stored in (more) centralized location makes maintaining and updating the data easier.
When possible, choose API-based access to inference data. When not, look for alternative solution.
However, in the context of non-ideal connectivity and with more rigorous real-time use case requirements, inference data needs to be brought physically at the point of inference.
How to do that?
Nextdata has intriguing solution to the problem. The solution is based on data products in portable containers. In that way, inference data accompanies AI models in the way discussed in Computing for AI at Scale – where ever inference is to take place.
Inference data accompanies AI models in portable containers where ever inference is to take place.
Containerized data products follow Data Mesh principles: Inference data is bundled with the necessary transformations and policies into modular, self-contained and highly portable units.
Embedding data with its own transformations and policies results in plug-and-play data products that are easy to deploy across all computing environments. Such modularity creates significant efficiencies for distributed inference scenarios, especially in the context of extensive AI-driven value creation at edge. A true AI at Scale solution.
Plug-and-play inference data modularity enables AI at Scale
Constraints Assessment
Symmetrical to computing , Constraints Assessment on data integration is about two things: a) Verifying whether and how AI data needs in terms of training and inference are being served, and b) Assessing the completeness and maturity of AI data solutions and capabilities.
Constraints Assessment investigates how AI data needs are being served and what is the level of AI data solutions’ maturity
Here are some broad examples of checks to be made in each key assessment area:
In practise, Constraints Assessment on data capabilities needs to be significantly more detailed and extensive than the examples above. The main thing is to be systematic in order to gain reliable and accurate view of the current state.
Constraints Assessment needs to be extensive and systematic to result in reliable and accurate current state view
Data Strategy for AI at Scale
AI builds on computing and data. Computing strategy was discussed last time , now is the time to explore key aspects of a data strategy.
The single biggest driver behind data strategy is competitive landscape becoming AI-defined with industry leaders pushing the Productivity Frontier with cutting edge AI use cases. Naturally, AI technology evolution is the underlying force shaping competitive landscapes across all industries. Those dynamics were discussed in the article AI-defined competitive landscape .
In addition, discussion on Technology evolution perspective above showed how AI data needs increase fast when moving from traditional machine learning towards multimodal generative AI. From business oriented perspective, generic use cases like Natural Language Processing become significantly more powerful with latest AI technology, as discussed in article on AI technology evolution .
Through those perspectives, we are witnessing how AI data needs become increasingly more complex as AI models progress from traditional analytics to advanced deep learning and generative AI applications:
Each evolutionary step brings its own data requirements, shaping the infrastructure, governance, and integration capabilities needed to support AI at Scale. As competitive landscape pushes towards more sophisticated AI use cases, data strategy must accommodate increasingly diverse and intensive data requirements.
To keep up with the competition, data strategy must accommodate increasingly diverse and intensive data requirements.
In essense, it boils down to three things: 1) Strategic clarity, 2) Decision-making, and 3) Systematic data capability build-up. Decision-making in the Age of AI, marked by high degree of complexity and uncertainty, was discussed in an earlier article Decision-making in AI transformation exploring ways to add strategic clarity.
In term of systematic capability build-up, Constraints Assessment is the first step – to identify and understand shortcomings and bottlenecks in order to start eliminating them one by one.
For systematic data capability build-up, Constraints Assessment is the first step.
Strategic alternatives and options
Most of the data solutions discussed above are not optional. Targeted implementation level may vary depending on business needs, but eventually there needs to be a solution for data governance, batch and real-time data integration, data cataloging, and data storage.
However, in case of DataOps, Data Fabric, Data Mesh and real-time inference there are strategic choices to be made. Let’s have a closer look.
DataOps deployment
DataOps is strategic option that has the potential to bring industrial-grade scale, quality, agility and speed to data operations. Combined with other digital capabilities – most of all AI Engineering and Software Engineering – it enables Continuous Delivery that is at the heart of digital innovation.
DataOps is not mandatory in the early phases of AI transformation but achieving AI at Scale does not appear viable without DataOps level data capabilities.
Choice between Data Fabric and Data Mesh
Data Fabric is an excellent choice as an intermediate solution to boost data cataloging with a semantic layer over existing data infrastucture. However, achieving AI at Scale calls for more than that.
In terms of AI transformation and related change management, Data Mesh requires more to implement but – contrary to Data Fabric – facilitates business domains emerging as engines for digital innovation and value creation. It can be argued that all things considered, that is the single biggest differentiator in the Age of AI.
Business Domains as engines for digital innovation and value creation appear as the single biggest differentiator in the Age of AI.
Scalability of real-time inference
Distributed inference at scale is a significant system design and technology challenge. With less than ideal connectivity, real-time AI use cases at edge remain unserved by API-based access to inference data.
Bringing inference data physically at the point of inference may not be that difficult – except when it needs to be done in a sustainable manner at scale. Containerization enabled plug-and-play data products start to appear lucrative strategic option.
Strategic objectives for digital capabilities
In the article on Alignment , five strategic objectives were set to digital capabilities: Scalability, Quality, Speed, Agility and Innovation. Previous article assessed those objectives from AI computing perspective. Now is the time for similar assessment from the perspective of data capabilities:
Conclusions
AI builds on data. With competitive landscape becoming AI-defined, underlying data capabilities make all the difference in terms of competitiveness in the Age of AI.
However, AI data requirements are increasingly demanding, both in terms of AI model training and inference. Each AI technology evolutionary step raises the bar higher.
Keeping up with the competition calls for strategic management in relation to data capabilities. Not only is it critical to understand data capabilities’ impact to financial performance – things like pricing power, operational efficiency, margins and growth – but also in terms of having accurate view on current state of those capabilities in order to improve.
The main purpose of this article has been to provide tools for strategic management – ultimately for better business performance in the Age of AI.
Constraints Assessment as a Service
Constraints Assessment as a Service covers all digital capability areas from Strategic Management to Data Culture. See detailed Service Description .
AI at Scale Workshop
AI at Scale workshop is a compact one-day event for business and technology executives and managers. Workshop seeks answers to the question: What should we as a business and as an organization do to secure our success in the Age of AI?
Awesome insights! ?? Excited to see the focus on Data Mesh for AI at scale! Antti Pikkusaari