Data Integration for AI at Scale – Identification 4 of 8
AI at Scale sets the bar high on data capabilities

Data Integration for AI at Scale – Identification 4 of 8

AI builds on computing and data. Computing was discussed last time, it is now time to explore data capabilities.

Rather than Data Management, the title Data Integration has been chosen to highlight consumption perspective. Emphasis is on AI data needs rather than on other data consumers like BI or direct data asset monetization – not that they wouldn’t matter.

In reality, however, all data management best practises apply. AI as the primary customer does not change that. On the contrary, in order to get data succesfully integrated to power those hundreds of AI use cases, many things need to click. As the exploration on AI data needs reveals, the bar is set high on everything about data and related capabilities like data governance.

Because of the apparent symmetry between computing and data, the structure of this article follows that of computing: Needs, Solutions, Constraints and Strategy. Contents is new but contexts are largerly the same.

The structure of the article follows that of computing: Needs, Solutions, Constraints and Strategy

AI data needs

Exploring AI data needs starts from identifying contexts. With computing, the primary contexts were AI model training and inference. Because of the symmetry between computing and data, there are no reasons to choose otherwise. In addition, computing environments of products, services and processes form the underlying contexts.

However, the symmetry between computing and data ends there. Due to AI technology evolutionary steps and great diversity of AI use cases, the picture of AI data needs is more nuanced compared to AI computing needs. In other words, the full picture of AI data need emerges from the synthesis of AI model training and inference as contexts, but now combined with variance created by AI technology evolution steps and use case specific needs.

The picture of AI data needs is more nuanced compared to AI computing needs

AI at Scale has both qualitative and quantitative definition. The former is about embedding AI in all aspects of value creation. The latter entails tens or even hundreds of concurrent AI use cases. When combined with AI technology evolutionary steps, the total landscape of AI data needs starts to take form.

At the end of the day, understanding AI data needs is about strategic clarity. Complexities and uncertainties associated with the Age of AI make achieving clarity important. In reality, serving all AI data needs may take long due to capability build-up lead time. But there are no excuses for not understanding the needs themselves.

Understanding AI data needs adds to strategic clarity

Article What matters now: AI at Scale captures the essence with this: “Data, AI and software engineering practises are foundational in achieving AI at Scale. Rather than based on improvisation or artesan-like tailoring and tinkering, these operations need to be industrial-grade from design to development and from testing to deployment.”

That’s it in one word: Industrial-grade. AI Engineering and Software Engineering will be explored in future articles, let’s start with data capabilities.

AI at Scale builds on industrial-grade data, AI and software capabilities.

AI model training

Creation and refinement of high-performance AI models require a lot from data. Here’s an outline of the key factors:

  • Large and diverse datasets – AI models, particularly deep learning models, require vast amounts of data to learn patterns and make accurate predictions. The volume of data is often directly correlated with model performance. Diversity of data is also critical, as training on homogeneous data can lead to models that overfit or fail to generalize in real-world scenarios. Overall, training data must represent the real-world complexity the model will encounter.
  • Labeled data – For supervised learning, which is one of the most common approaches in AI, labeled data is crucial. Each data point must be paired with the correct label, e.g. an image labeled "dog" or "cat". Labeled data is used to train the model to make the correct predictions. In some cases, semi-supervised or unsupervised learning can be used, but high-quality labeled data remains important for a wide range of AI models, particularly in industries with complex classification tasks.
  • Data quality and accuracy – Data cleanliness is paramount in AI model training. The presence of noise, inconsistencies or inaccuracies can significantly degrade model performance. Preprocessing steps like data normalization, handling missing values and data augmentation (creating synthetic data) are often necessary to ensure the data is usable for training.
  • Structured and unstructured data – AI models consume both structured data (e.g. spreadsheets) and unstructured data (e.g. text, images, video, audio). Each type of data requires different preprocessing, transformation and integration techniques. Unstructured data presents challenges but is also highly valuable for AI use cases releted to natural language processing (NLP), speech recognition, and image and video processing.
  • Historical and real-time data – Training often requires historical data to help models learn from past patterns. This is especially critical in time-series forecasting (e.g. sales, weather) or recommendation engines (e.g. user behavior over time). Access to real-time data streams may also be necessary for training models designed to operate in dynamic environments (e.g. autonomous vehicles or real-time fraud detection).

Inference

Inference is about application of trained models to new, unseen data to crete predictions and insights, or in the case of generative AI, new data. While model training requires large, often historical datasets, inference has more specific and varied data needs depending on the use case. Here’s an outline of key factors:

  • Real-time data – In many AI applications, such as fraud detection, predictive maintenance or autonomous systems, the model must process real-time data streams. This demands low-latency, high-throughput data pipelines to ensure that the model can provide immediate results. Data for inference needs to be continuously available and accessible in real-time to ensure the model's effectiveness in dynamic environments.
  • Contextual data – For effective inference, the model often requires contextual data. For example, an AI system making recommendations may need access to a user's recent behavior, purchase history or environmental conditions. The ability to integrate such data in real-time or near real-time is critical. In some cases, additional metadata or auxiliary information might be needed to refine the model’s predictions.
  • Streaming data and batch data – While many AI inference systems work on real-time streaming data, others work on batch data. This can involve processing a set of inputs all at once, e.g. image recognition across a dataset of images or transactional analysis in financial systems. Data infrastructure must be able to manage both batch and real-time data, depending on the use case.
  • Data accuracy and timeliness – Unlike training, where large datasets may undergo thorough cleaning and preprocessing, data for inference must be timely and often more lightly processed to ensure that insights are generated in a reasonable time frame. Maintaining high data accuracy is still essential, but some preprocessing steps may be skipped in favor of speed and responsiveness.

Technology evolution perspective

As AI evolves from Traditional Analytics (TA) to Machine Learning (ML), and further to Deep Learning (DL) and Generative AI (GAI), the data needs become increasingly complex and demanding. Each evolutionary step brings its own data requirements, shaping the data solutions necessary to support AI at Scale.

Traditional Analytics

Analytics, often classified under descriptive or diagnostic analytics, focuses on interpreting past data to derive insights, typically through business intelligence tools or statistical methods. Data needs for analytics are less complex compared to ML, DL or GAI.

Training data needs in TA include things like:

  • Structured data – TA primarily operates on structured data from databases and spreadsheets. Clean, pre-processed data is essential.
  • Historical data – Descriptive analytics relies heavily on historical datasets for identifying trends, patterns, and correlations.
  • Aggregated data – High-level summaries of data like monthly sales are often used.

Correspondingly, inference data needs in TA:

  • Batch data – Inference in TA often happens in batch mode. Real-time data is less common.
  • Pre-aggregated data – Since TA typically involves running queries or reports, it uses pre-aggregated datasets rather than raw or streaming data.

Machine Learning

ML involves learning patterns from data using statistical models such as regression, decision trees or support vector machines (SVMs). It is widely used in structured data environments like finance and retail.

Training data needs in ML:

  • Structured data – Data is often structured and numerical, e.g. customer purchase history or sales data.
  • Labeled data – ML models, especially for supervised learning, require labeled datasets for training. This could involve anything from labeled customer data to labeled sensor readings.
  • Feature engineering – Human intervention in preparing and selecting features from structured datasets is a big focus in traditional ML. The data needs to be well-prepared and cleaned through feature engineering to highlight the right aspects for the model.
  • Moderate data volume – Compared to deep learning, traditional ML models can work with smaller datasets.

Correspondingly, inference data needs are as follows:

  • Batch or real-time – Inference for ML can be done in batch mode (e.g. credit scoring after loan applications) or real-time (e.g. fraud detection).
  • Contextual Data – To make predictions, ML typically requires access to structured contextual data at inference time, such as user’s profile or transaction details.

Deep Learning

DL involves neural networks that automatically learn complex patterns from vast amounts of data, significantly reducing the need for human-designed features. DL is the foundation for tasks like image recognition, natural language processing, and complex time-series analysis.

Training data Needs in DL:

  • Unstructured data – DL excels on unstructured data consisting of free text documents, images and video.
  • Massive datasets – DL requires very large datasets to achieve good performance. For example, training a convolutional neural network (CNN) for image recognition requires millions of labeled images. DL thrives on unstructured data like images, audio, text and video, as well as structured data for certain types of applications like time-series analysis.
  • Labeled and unlabeled data – For supervised DL training, labeled datasets are essential. However, DL models are also applied in unsupervised and self-supervised learning contexts, meaning they can leverage large unlabeled datasets by using autoencoders to learn representations from unlabeled data.
  • Data augmentation – Since labeled data is hard to come by, data augmentation like rotating or cropping images or adding noise to audio is a common technique to artificially increase the size and diversity of training datasets, which enhances model generalization.
  • Complex data preprocessing – Normalization, standardization and other preprocessing techniques are critical to ensure that data fed into neural networks is in the appropriate range, e.g. images normalized to pixel values between 0 and 1.
  • High dimensionality – DL models work well with high-dimensional data. For example, text embeddings in NLP or high-resolution images can have millions of dimensions, making both data preparation and computational requirements intensive.

DL inference data needs:

  • Real-time data – Many DL use cases, such as speech recognition, real-time translation, or autonomous driving, require access to real-time streaming data. Low-latency, high-throughput pipelines are necessary for DL inference to process and act on incoming data almost instantly.
  • Preprocessed data – Just like in training, data for DL inference often needs to be preprocessed before it is fed into the model, e.g. text tokenization for NLP or image resizing for CNNs.
  • Batch inference – For some use cases, e.g. image recognition on a dataset of thousands of photos, DL inference can also operate in batch mode, processing multiple inputs at once. However, real-time inference is more common.

Generative AI

Generative AI goes beyond recognizing patterns in data to creating new data: text, audio, images and video. This adds new dimensions to data needs.

Generative AI adds new dimensions to data needs

GAI training data needs:

  • Massive, diverse and high-quality datasets – GAI models require extremely large and diverse datasets. For instance, GPT-3 was trained on hundreds of billions of words across diverse text sources to generate human-like text. Similarly, Generative Adversial Networks (GANs) used for image generation need vast sets of high-quality images.
  • Pretrained data – GAI often benefits from transfer learning where the model is pre-trained on a general, large dataset and then fine-tuned on a smaller, domain-specific dataset. This leads to need to access both general-purpose and task-specific data.
  • Unstructured data – Like deep learning, GAI models thrive on unstructured data, whether it's raw text, images, video or audio files.
  • Contextual and metadata – Generative models, particularly in multimodal tasks (for e.g. text-to-image generation), need access to contextual data like captions and tags. That kind of metadata is used to guide the generation process, making rich, labeled data critical for training.
  • Synthetic data – Interestingly, GAI can also create synthetic data for training itself or other models. For example, GANs can generate images that can then be used to augment the training data for other tasks, e.g. object detection models.

Correspondingly, GAI inference data needs are as follows:

  • Low-latency, high-throughput data – GAI models, especially for conversational agents, real-time translation, or interactive content creation (e.g. chatbots), require real-time data to generate immediate responses or outputs based on new inputs.
  • Contextual and auxiliary data – In GAI tasks like generating images from text descriptions, inference requires contextual data or auxiliary metadata to guide the generation. For instance, text-to-image models need precise inputs (descriptions) to generate coherent images.
  • Adaptive data – For tasks like interactive dialogue generation, GAI models must dynamically adjust their output based on changing input data (e.g. context in a conversation), meaning that data integration has to support continuous, adaptive data streams.
  • Preprocessing in multimodal applications – For multimodal applications that combine text, images and audio, GAI requires preprocessing pipelines for different data types, e.g. tokenizing text, resizing images, or converting audio to spectrograms.

In conclusion, advances in AI technology not only demand a lot more from computing but also from data. Powering state-of-the-art multimodal Generative AI use cases is a non-trivial data integration task.

Multimodal Generative AI use cases require a lot from data integration

Use cases perspective

Taking use cases perspective to AI data needs is a powerful way to complement perspectives discussed above. By looking at AI data needs through the lens of specific use cases, it is possible to highlight how different applications have distinct data requirements, even though they may share underlying AI techniques.

Rather than relying on narrow focus of context-specific use cases, the granularity deployed here is generic use case discussed in an earlier article on AI technology evolution . That leads to much better overview on the relationship between AI use cases and their typical data needs. That is not to say that detailed data needs are eventually always context-specific.

Natural Language Processing (NLP)

Context-specific use case examples: Text classification, sentiment analysis, machine translation, chatbot.

Data needs:

  • Large amounts of unstructured text: Documents, chat logs, emails, social media posts, etc.
  • Labeled data: For supervised tasks like sentiment analysis, text must be labeled, e.g. "positive," "negative".
  • Diversity in text: Different languages, dialects, and contexts, e.g. informal social media vs. formal emails are necessary for models to generalize.
  • Preprocessing: Tokenization, stop word removal, stemming and lemmatization.
  • Real-time data: For applications like real-time translation or chatbots, the model needs access to live, streaming data.

Image and Video Processing

Context-specific use case examples: Object detection, image classification, facial recognition, video surveillance analysis.

Data needs:

  • High-resolution, labeled images or video: Data must be pre-annotated for tasks like object detection, e.g. bounding boxes for faces or objects.
  • Diverse data: Models need diverse datasets to generalize to different environments (lighting conditions, angles, backgrounds).
  • Data augmentation: To artificially increase the training set, techniques like flipping, cropping, or adding noise to images are often applied.
  • Batch and real-time processing: Video surveillance and self-driving cars, for instance, need to process data in real time, whereas other tasks like batch image classification can be done offline.

Speech Recognition

Context-specific use case examples: Voice-to-text, virtual assistants, real-time transcription, voice commands.

Data needs:

  • Labeled audio data: Voice recordings paired with text transcriptions are essential for training speech recognition models.
  • Diversity in accents and environments: Models need exposure to diverse accents, languages, and noisy environments (background noise, overlapping speech).
  • Real-time streaming data: Many applications like virtual assistants or real-time transcription systems require streaming data to generate immediate responses.
  • Multimodal data: In some cases (e.g. video calls), both audio and video data may be combined for better context understanding.

Automation

Context-specific use case examples: workflow automation, intelligent assistants, autonomous vehicles.

Data needs:

  • Structured and unstructured data: Automation tasks often involve both (e.g., structured data from ERP systems and unstructured customer queries in chatbot scenarios).
  • Real-time decision data: Autonomous vehicles, for example, need constant real-time updates from sensors (cameras, LiDAR, etc.) to make split-second decisions.
  • Pre-labeled training data: For chatbots or virtual assistants, extensive interaction data (paired with expected responses) is needed for training.
  • Operational data: Automation systems in manufacturing or robotics may require access to real-time sensor data for optimal operation.

Multimodal User Interface

Context-specific use case examples: UI that combines text, voice and visual inputs, e.g. AI assistants with visual components.

Data needs:

  • Multimodal datasets: The system needs data that spans text, images, video, and speech all in one instance, e.g. voice command paired with a video or image response.
  • Synchronization of data streams: Real-time integration of multiple data types, e.g. processing voice input while visually analyzing an image.
  • Contextual and auxiliary data: The system needs to use metadata or user context to provide more accurate, context-aware responses.

Recommendation Systems

Context-specific use case examples: E-commerce product recommendations, content streaming platforms, personalized ads.

Data needs:

  • User interaction data: Clicks, search history, purchase records, viewing history, and ratings are necessary for training models to make personalized recommendations.
  • Real-time data: For dynamic recommendation systems, user interaction data must be processed in real time to update recommendations.
  • Collaborative filtering data: Data about user similarities (users who watched similar shows, bought similar products) is essential to recommendation engines.
  • Batch processing: Periodic updates to user profiles based on new user behavior data can be processed in batches.

Anomaly Detection

Context-specific use case examples: Fraud detection, network intrusion detection, manufacturing defect detection.

Data needs:

  • Labeled and unlabeled data: Many anomaly detection systems use unsupervised learning, so they work with unlabeled datasets. However, labeled data is still important for fine-tuning models.
  • Historical data: To detect anomalies, systems need a baseline of normal behavior across time.
  • Real-time streaming data: Many applications, such as fraud detection in financial transactions, require real-time analysis to catch anomalies as they happen.
  • Imbalanced datasets: Anomalies are rare, so models often deal with imbalanced datasets (where the “normal” class dominates). Data balancing techniques may be necessary.

Robotics

Context-specific use case examples: Industrial robots, autonomous drones, robotic surgery.

Data needs:

  • Sensor data: Real-time data from various sensors (e.g. LiDAR, cameras, accelerometers, gyroscopes) is essential for robotic decision-making and navigation.
  • Time-series data: Robots often rely on time-series data to track movement or environmental changes over time.
  • Pre-labeled data: For training autonomous navigation or object recognition models, large volumes of annotated video or LiDAR data are required.

Digital Twins

Context-specific use case examples: Virtual models of physical assets for real-time monitoring, predictive maintenance, simulations.

Data needs:

  • Sensor data: Digital twins rely on real-time sensor data from the physical asset to mirror its behavior in the digital environment.
  • Historical data: To build accurate models of physical assets, historical operational data is needed to train predictive models.
  • Batch and real-time data: While real-time data is necessary for live updates, batch data processing might be used for retrospective analyses or simulations.
  • Multimodal data: Depending on the complexity of the physical system, data from multiple types of sensors (e.g. temperature, vibration, pressure) might need to be integrated into the twin.

Predictive Analytics

Context-specific use case examples: Sales forecasting, customer churn prediction, supply chain optimization.

Data needs:

  • Structured historical data: Predictive models require structured historical datasets, such as sales data, customer behavior data, or transactional data, for accurate forecasting.
  • Labeled data: Labeled datasets are crucial for supervised learning tasks, e.g. whether a customer churned or not.
  • Time-series data: For forecasting tasks, time-series data with temporal patterns and trends is critical.
  • Batch processing: Predictive analytics often processes data in batch mode (e.g. quarterly sales forecasts) but may incorporate real-time data for near real-time predictions in some cases.

Operational Process Optimization

Context-specific use case examples: Manufacturing line optimization, resource allocation in logistics, scheduling.

Data needs:

  • Operational data: Process optimization relies on vast amounts of real-time operational data from sensors, machines, or business processes.
  • Structured data: Historical structured data is used to model and optimize operations, e.g. manufacturing times, equipment uptime.
  • Batch and real-time data: Some processes require batch data processing, e.g. optimizing the next production cycle, while others, like dynamic scheduling or resource allocation, may need real-time adjustments.

Resource Usage Optimization

Context-specific use case examples: Energy consumption optimization, water usage management, cloud resource optimization.

Data needs:

  • Time-series data: Resource optimization often requires detailed, timestamped data on resource usage, e.g. energy usage over time.
  • Sensor data: In fields like energy and utilities, sensor data from smart meters or IoT devices is necessary to track consumption patterns in real-time.
  • Batch and real-time data: Predictions can be made based on historical data, but real-time data is also needed for dynamic optimization, e.g. adjusting energy consumption based on peak times.

AI data solutions

Simplified, AI data solutions are about serving AI data needs across all contexts from training to inference, and across all technology evolutionary steps and AI use cases. As we will shortly discover, the way data solutions answer the call is diverse and may appear complex if not addressed with rigor.

AI data solutions are about serving AI data needs outlined above

Some solutions are foundational without realistic alternatives, some come with alternatives and options. Some solutions operate more in the background, others in the frontline with direct involvement in operationalizing AI training and inference.

In the context of AI at Scale, data solutions’ scalability is naturally in focus. Turns out that there are fundamental differences in that respect – to be taken into consideration when AI at Scale is set as strategic objective.

To keep this article reasonably compact, solutions are merely introduced. That is, they are Identified. It for later articles to explore their Acquisition (how to obtain digital capability in question), Configuration (how to organize and structure the capability), and Management (how to maintain the capability for optimum long-term results).

This article is about Identification. It is for later articles to explore the aspects of Acquisition, Configuration and Management of data solutions alongside other digital capabilities needed for AI at Scale.

Data Governance

Data Governance is about those underlying and often unvisible structures, processes and practises without which the data integration frontline would soon collapse. AI at Scale relies on rock solid data governance practises and there’s little room for compromises.

AI at Scale relies on rock solid data governance

As is in the case of corporate governance, significant failures in data governance may fester into a catastrophe. Less dramatic outcomes connect to data governance shortcomings leading to unreliable data for AI models, compliance risks, and operational inefficiencies.

Data Governance is an enabler for multitude of things including data quality, consistency, security, privacy and regulatory compliance. However, thru ownership, accountability and initiative data governance connects also to data relevance and business value.

Turns out that there’s a superstructure beyond data governance: Company operating model. It goes without saying that these structural elements need to be carefully aligned. Operating Model and related structural elements will be discussed in later articles. For now, they are assumed rather than directly observed.

Data governance is to enable data quality, consistency, security, privacy and compliance but also ownership, accountability and initiative.

Data governance is best seen as a framework encompassing policies, standards and practises that ensure effective and efficient use of data within an organization.

Key data governance design parameter is this: It should facilitate innovation and progress rather then impede them. Poorly designed or implemented data governance may result in perfect regulatory compliance but with excessive harm to business. Thererefore, data governance framework needs to be agile with streamlined operations and maximum amount of automation. All practises and standards must work to improve efficiency rather than add control for control’s sake.

Data governance implementation in the absence of business acumen may lead to serious shortcomings. Therefore, data governance implementation in the context of AI at Scale connects to strategic management.

Data governance is to facilitate rather than impede digital innovation

Overall, data governance framework consists of following key elements:

  • Data ownership and responsibilities – Data ownership assigned as per business needs. Key data governance responsibilities defined, e.g. data stewardship.
  • Governance structure – Committees, steering groups and working groups defined for effective policy definition and enforcement, and for overall data related decision-making. Governance bodies with clear charters and mandates, including participating roles.
  • Policy definition – Establishing rules, standards and practices for effective data management. Covers things like access rights, usage guidelines, data lifecycle management guidelines, quality standards, security and privacy policies, and compliance requirements.
  • Policy enforcement – Ensuring that data policies are adhered to across the organization. Policy enforcement can be manual thru decision-making within governance structure and thru daily data management operations, or automated with policy-as-code mechanisms. Policy enforcement covers multitude of aspects including data classification and setting policies in data catalog, encryption and anonymization, regulatory compliance monitoring, oversight of ethical use of data, and defining processes for data lifecycle management.

Ideal data governance implementation scales thru decentralization and deploys high degree of automation. For federated data governance with automated policy enforcement, see Data Mesh below.

High performance scalable data governance utilizes decentralization and automation

Batch data integration

Batch data integration is the process of consolidating data from different sources to provide a unified and coherent view over the data. Integration is basically about breaking the data silos by bringing the data together from multiple disparate sources. The sources can be anything from legacy databases to IoT sensors and from operational ERP modules to commercial 3rd party data.

Batch data integration deals with large datasets. As discussed above, AI model training is largely based on batch data rather than real-time data. Therefore, batch data integration is a critical enabler for virtually all AI model training. Somewhat depending on the technology evolutionary step and use case, AI model training may require enormous volumes of highly diverse data, including unstructured data like images, videos and text. Batch data integration solutions must handle this scale and diversity efficiently.

Batch data integration is critical enabler for AI model training

In addition to unified view, other key batch data integration objectives include accuracy, consistency, timeliness and efficiency. Or to put it in another way, making sure that AI training gets high-quality data in large enough quantities when it needs it.

Batch data integration can be depicted as data pipeline with raw data extracted from multiple data sources, transformed to suit model training needs, and then loaded into data repository to be accessed later – the so called Extract-Transform-Load (ETL) pipeline.

There are several solutions available for each of those stages. For example, Apache Spark for data transformation and various data warehouse and data lake solutions to store data once it has been processed and transformed. Data warehouse is used for tradional stuctured data whereas data lake can store unstructured data as well. Combination of data warehouse and data lake is sometimes called lakehouse.

ELT is another pipeline variant where data transformation happens after it has been stored in a data repository. Of the two, ETL is traditionally used when transformation logic is complex and the volume of data is manageable, whereas ELT is more suited for large datasets.

Interestingly, more complex transformations can even utilize AI models by themselves. In such scenario, an AI model would act both as a data consumer and as a data source.

Complex data transformation may utilize AI model by itself

Real-time data integration

Real-time data integration and batch data integration differ significantly in their architecture, methods, objectives and often also in contexts. Real-time data integration core objective is to enable low-latency data ingestion, transformation and processing to meet real-time AI inference needs. In comparison, AI model training relies mostly on historical batch data although not exclusively.

Need to minimize latency is what defines real-time data integration. That is, real-time data integration solutions are optimized for low latency rather than large amounts of historial data that goes through elaborate transformations. While batch data integration can take the time needed for complex data transformations, real-time integration targets at simpler and quicker action.

Need to minimize latency defines real-time data integration

AI use cases perspective discussed above recognized the need for real-time data across the board: Virtually all listed use cases require real-time data – if not for training then latest for inference.

However, also in the context of inference, there’s significant amount of variance. AI use cases designed for immediate responses require real-time data. Correspondingly, use cases such as predictive analytics can be based on historical data and batch data integration.

An overview of key differences between batch and real-time data integration looks like this:

  • Objectives and expected results – Real-time processing aims to provide immediate insights to enable quick decision-making. Real-time output is used for e.g. interaction with human user, alerts, or immediate input to operational systems. In comparison, results of batch processing are typically more comprehensive and used for in-depth analysis and historical insights.
  • Input data – Real-time processing relies on continuous stream of data that is processed immediately on arrival. Batch data integration deals with large volumes of data collected over a period of time and then processed later at a scheduled time.
  • Data transformation – Real-time processing on-the-fly in milliseconds or at maximum within seconds. Emphasis on speed, efficiency and simplicity. Batch processing involves more comprehensive and complex transformations since there is more time to spend.
  • State management – Real-time AI models often need to manage state over a stream of data, handling state changes dynamically. Batch data based models can work with a static dataset without state information.

Traditional ETL/ELT pipelines are generally not used for real-time processing, primarily because the transform stage in them takes too much time. Instead, Stream Processing techniques and tools like event streaming platforms are employed to handle continuous data streams from e.g. IoT sensors in edge devices or on-line customer interactions. Low-latency data APIs can be deployed to access data needed for real-time inference.

DataOps

DevOps has firmly established itself as software development practise to bring development and operations together to improve the speed, agility and quality. DataOps extends many DevOps practises and methods to data. Further, MLOps builds on both of them in order to operationalize and scale up AI model development and deployment.

Although DevOps, DataOps and MLOps are conceptually separate, they are closely interlinked in practise. Therefore, full “XOps” discussion will have to wait until the article on Operating Model for unified and consistent discussion of the trio. In the meantime, short introduction to DataOps will have to suffice.

Core DataOps objective is to orchestrate data workflows and automate pipelines, ensuring industrial-grade scale, quality, agility and speed across data operations – enabling innovative data solutions and AI use cases. AI at Scale cannot be based on artisan-like ad-hoc tailoring and tinkering.

DataOps brings industrial-grade scale, quality, agility and speed to data operations – enabling AI at Scale.

Just like with DevOps, Continuous Delivery is the central element of DataOps. Continuous Delivery builds on set of interlinked activities including automation, computing environment configuration, version control of everything, operation monitoring, and roll-back. Continuous Delivery enables repetitive experimentation and learning that are in the heart of digital innovation.

Continuous Delivery with DataOps is key enabler for digital innovation

Data Mesh

Data Mesh signifies paradigm shift in data management and integration. By allocating data ownership to business domains, Data Mesh facilitates data-driven initiative and innovation like nothing else before. With decentralization at its core, Data Mesh has scalability built in. For in-depth background, see Data Mesh book review and beyond .

Data Mesh supports AI at Scale natively. Not only as a data management solution but also indirectly through implied decentralized operating model. Full discussion on operating model needs to wait for later article. For now, it is sufficient to note that decentralized operating model for AI at Scale and Data Mesh are not only fully aligned – they are also highly synergistic with many key concepts from ownership to shared semantic understanding.

Data Mesh and decentralized operating model for AI at Scale are highly synergistic

Achieving AI at Scale without Data Mesh is possible but not efficient or effective in long term. See discussion on Data Fabric and Data Mesh below.

Data Mesh builds on four cornerstones:

  • Business Domains – Domains are the key mechanism to implement decentralized data ownership. Domains can be business functions like Marketing but they can also be defined in other ways. Key principle is to tie data ownership together with business know-how in order to create tight-knit operational teams with shared semantic understanding.
  • Data Products – Data products provide the means to facilitate data ownership, sharing, discovery and utilization. By definition, data products are owned and maintained by business domains. Data product design makes them easy to develop, maintain and share. Data product portability is essential in the context of distributed inference at scale, see Case Nextdata below.
  • Self-serve platform – Self-serve platform enables domains to take ownership over data products by hiding most of the related engineering complexity. By reducing cognitive load on the domain team, self-service platform minimizes operational costs. An extension of self-service platform – i.e. extensive platform engineering capabilities – is an essential enabler for decentralized operating model overall.
  • Federated computational governance – The fourth Data Mesh cornerstone enables efficient data governance in highly distributed environment where data is being maintained by multiple business domains rather than single centralized authority. Enter high degree of automation. With Data Mesh, data policies are still created (for most part) centrally but enforced in distributed and automated manner. In essence, Business Domains are responsible for enforcing policies locally but within global governance guidelines.

Summarized, Data Mesh is a combination of 1) Decentralized Operating Model consisting of domain-driven organizational structure, distributed governance and policy enforcement, platform engineering, and practises for data product lifecycle management, 2) Architectural design for domain structure, data products and data mesh platform, and 3) Platform technology, including modern software development methodology enabling data product development, sharing and portability.

Data Mesh is a combination of decentralized operating model, architectural design, and platform technology.

Data cataloging and metadata

Data cataloging serves as essential enabler for data organization, governance and discovery. Data cataloging serves multiple purposes, including:

  • Data Governance – By associating governance rules like access control, privacy and compliance requirements with data assets, data catalog ensures that only authorized users can access specific data.
  • Data discovery – Provide data users with searchable inventory of available data assets within the company, so that they can quickly find, understand and utilize data.
  • Metadata management – Create a central repository of metadata that describes each data asset’s origin, structure, quality, relationships and governance policies.
  • Enhance collaboration – Foster collaborative environment where data teams can share best practises, insights and documentation about data, creating a shared knowledge base.
  • Data Quality – Ensure that users can trust the data they find by maintaining and exposing metadata on data lineage, source information and data quality metrics.

Metadata, i.e. data about data, plays a central role in data cataloging and in data management and integration in general. Metadata falls into several categories:

  • Descriptive metadata provides detailed description of the data, e.g. field names, data types, size, and dependencies.
  • Structural metadata describes how the data is organized and stored, e.g. database schema and file formats, thus providing critical information for data integration tasks.
  • Administrative metadata covers governance policies, data security, access control, and compliance rules.
  • Lineage metadata documents the history of data, including where it was originated and how it has been transformed. This enables data traceability and auditability.
  • Operational metadata tracks data usage patterns, quality metrics, and operational performance.

In traditional contexts predating Data Mesh, a data catalog is typically associated with data warehouses and data lakes stroring structured and unstructured data in centralized repositories. Organizationally, data catalog maintenance is then performed by centralized IT and data teams.

Data Mesh fundamentally changes the way data cataloging is done by shifting from centralized to decentralized operaring model. With distributed data ownership, responsibility for data asset maintenance and data cataloging moves to business domains. Metadata management remains critical but is now done in a federated manner. Data products carry their own metadata and must comply with overarching governance policies that are set centrally but applied locally – thru Federated Computational Governance, as discussed above.

While Data Mesh decentralizes data ownership, data discovery across all company data assets remains essential. Therefore, a centralized data catalog would still exist, but its role shifts to enabling federated discovery mechanism where data products registered by each business domain are discoverable enterprise-wide.

Data Mesh decentralizes data ownership but continues to rely on centralized catalog for data products to be discoverable enterprise-wide.

Data Fabric vs. Data Mesh

Data Fabric builds on traditional data cataloging by applying automation and AI-driven insight to make data discovery and access even more easier. In effect, Data Fabric adds semantic layer over existing data architecture consisting of data warehouses, lakes and catalogs. In this way, it aims to address the cardinal sin of centralized data storage solutions: semantic understanding lost forever – with data lakes turning into data swamps nobody wants to touch with a ten-foot pole.

Data Fabric addresses the cardinal sin of centralized data storage solutions: semantic understanding forever lost

While representing significant improvement, from AI at Scale perspective Data Fabric remains only a partial solution with high intermediate potential but lacking characteristics of an ultimate solution.

Why is this?

Because Data Fabric addresses only relatively small subset of the overall AI at Scale challenge on digital capabilities. In order to achieve true AI at Scale, digital capabilities are to be perceived holistically rather than with data management perspective only. Because of that limited perspective, Data Fabric remains a point-solution – valuable and effective but not sufficient.

Because of limited perspective of data management, in the context of AI at Scale, Data Fabric remains a point-solution.

Conversely, while Data Mesh is not a complete solution either, it aligns with complete AI at Scale solution in ways that Data Fabric does not. Rather than starting from data management, by assigning data ownership to business domains, Data Mesh in effect starts from operating model. And that makes all the difference.

Data Mesh aligns with complete AI at Scale solution

In terms of adding semantic understanding, the goal and outcome of Data Fabric and Data Mesh are, if not equal, at least very similar. But the path there is completely different. Data Fabric does it with adding AI-driven insight on existing architecture and infrastructure, Data Mesh does it by redistribution of data ownership itself.

So, would adding semantic understanding be enough to achieve AI at Scale?

No, it would not.

To truly scale things up, there needs to be ownership, incentives and initiative – all linked tightly with expertise within the Bounded Context of a business domain. Scale emerges from business domains becoming engines of digital innovation and value creation.

Scale emerges from Business Domains becoming engines of digital innovation and value creation

That is the solution proposed and assumed by Data Mesh. A solution that can lead to hundreds of sustainable AI use cases. Full solution description will be included in the article on operating model.

To scale things up, semantic understanding needs to be accompanied with ownership, incentive and initiative.

Data storage

Using the AI technology evolution perspective discussed above, it is possible to assess how data storage solutions have evolved over the years to serve increasingly demanding AI needs.

As AI evolves from Traditional Analytics (TA) to Machine Learning (ML), and further to Deep Learning (DL) and Generative AI (GAI), each evolutionary step introduces more requirements on data storage solutions. In addition, real-time inference sets specific low-latency requirements to storage solutions.

Traditional Analytics and structured data storage

Traditional Analytics – which is primarily descriptive and diagnostic – relies heavily on structured data. It utilizes data sets that are often tabular and relational, such as those found in SQL databases. At this stage, AI models are relatively simple and data needs are straightforward.

Storage solutions include Relational Databases such as MySQL that are ideal for structured and relational data. They are designed to handle tabular data and enable efficient queries with SQL.

Data Warehouses like Amazon Redshift and Google BigQuery can be used to store massive amounts of structured data. They serve large-scale analytics tasks based on batch processing at scale.

Corresponding AI models cover linear regression, logistic regression and decision trees that utilize structured data efficiently.

Machine Learning and storage for semi-structured data

As AI models evolve, so does the complexity of the data they consume. Semi-structured data – such as customer reviews and logs – provide richer insights based data that goes beyond relational tabular data thus requiring more flexible storage solutions.

Storage solutions include NoSQL Databases such as MongoDB and Cassandra. These databases are designed to handle semi-structured data with high flexibility and scalability. They are particularly useful when data does not fit neatly into relational schemas.

NoSQL databases are highly scalable, making them suitable for AI models that utilize large datasets distributed across multiple servers.

ML with semi-structured data is good for things like classification, anomaly detection and recommendation systems.

Deep Learning and unstructured data storage

Deep Learning, enabled by neural networks, excels on large amounts of unstructured data such as images, audio and video. This type of data cannot be handled by traditional databases, pushing the need for more advanced storage solutions.

Solutions include Object Storage such as Amazon S3 and Azure Blob Storage that are designed to handle large volumes of unstructured data like images, videos, audio files and documents. They are scalable and optimized for storing binary large objects (BLOBs). In addition, Distributed File Systems can store unstructured data across multiple nodes, ensuring redundancy and scalability.

With the addition of unstructured data storage capability, Data Warehouse evolves into more versatile Data Lake.

Corresponding AI models involve things like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) used for computer vision, speech recognition and natural language processing.

Vector databases for Multimodal Generative AI

Multimodal Generative AI requires new type of data storage: Vector database supporting vector embeddings to represent semantic meaning of text, images, audio, and video in a unified (vector) space. Representations of data as high-dimensional vectors enable contextual understanding across various data modalities, including text, image, video and audio.

Contextual semantic understanding and ability to associate meaning across different data modalities is both fascinating and revolutionary.

Such capability unlocks transformative AI use cases, from text-to-video tools to sophisticated virtual assistants interacting with the digital world, to humanoid-like robots operating efficiently and safely in the physical world. In all of these cases, multimodal AI models can interpret and act on inputs of varying modality with deep contextual awareness.

By converting each type of data into high-dimensional vectors, AI models can not only process but also associate meaning across different data modalities in a cohesive way. An AI system trained on language models can seamlessly associate its understanding of text with what it sees via machine vision or what it hears through speech recognition.

This will eventually result in a humanoid-like robot, capable of interacting with digital and physical worlds alike, understanding instructions not only based on text or voice but also on visual cues from its surroundings. That is, a physical AI agent that can speak about what it sees and act on what it hears, all while maintaining consistent understanding of its environment.

Storage for real-time inference

For real-time inference, storage solutions need to support high-speed data access and streaming capabilities. AI models require quick and efficient access to data without the delays that can occur in traditional storage systems.

Storage solutions include In-Memory Databases, such as Redis, that store data in RAM, enabling ultra-low-latency access. In addition, Time-Series Databases, such as Prometheus, are optimized for storing and querying time-stamped data from e.g. IoT sensors.

Graph database for relationships-oriented use cases

Finally, Graph Database is designed to model and store complex relationships within data using a graph structure, with nodes (representing entities) and edges (representing relationships between those entities).

Graph databases such as Amazon Neptune efficiently handle highly interconnected data by directly representing relationships, making it easier to query and explore connections in large, complex datasets.

Graph databases are often used alongside other storage solutions from structured data based machine learning all the way to Multimodal Knowledge Graphs where graph database store cross-modal relationships, while vector database manages vector embeddings for semantic search.

Overall, graph databases are essential for specific AI use cases where relationship analysis drives value.

Distributed inference at scale – case Nextdata

As discussed in Computing for AI at Scale , inference is fundamentally about distributed computing. In that context, AI model portability across multitude of computing environments emerged as key capability. However, the very same question applies to data: How to make data available for inference taking place potentially in thousands of edge devices?

Distributed inference manifests itself in two distinct ways:

  • Architectural decentralization – Inference typically needs to happen in different computing environments from cloud to on-premises data centers to edge devices. Each environment has different computing, storage and connectivity capabilities.
  • Physical distribution – Inference may need to take place in thousands of edge devices, often with limited connectivity capabilities in terms of continuous availability, bandwidth and latency. The data needed for inference may not be available in real-time via an API due to these connectivity limitations.

Ideal connectivity capability is easy to point out: always available, no bandwidth limitations, and very low latency. However, the closer we move towards edge the less common such ideal becomes. Subsequently, data portability emerges as an alternative to API based data access.

In system design terms, the key question then becomes: Do we rely on APIs to access inference data, or do we instead place the data physically at the point of inference?

Answer depends on several design parameters:

  • Connection reliability – Can the network guarantee continuous access to inference data via APIs? If not, how sensitive AI use case is for intermittent connectivity? Are the related tasks mission-critical?
  • Connection latency – What is the network induced latency? Is that too high for real-time use cases and tasks?
  • Data update frequency – How often does the inference data change?
  • Storage constraints – Do edge devices have enough storage capacity for local inference data?
  • Privacy and security – Are there legal or security constraints that require data to remain on the edge device or in a specific physical location?

With good enough edge connectivity in place and/or when AI use cases are not sensitive to connectivity limitations, the traditional API-based access to inference data is the default choice. Having the data stored in (more) centralized location makes maintaining and updating the data easier.

When possible, choose API-based access to inference data. When not, look for alternative solution.

However, in the context of non-ideal connectivity and with more rigorous real-time use case requirements, inference data needs to be brought physically at the point of inference.

How to do that?

Nextdata has intriguing solution to the problem. The solution is based on data products in portable containers. In that way, inference data accompanies AI models in the way discussed in Computing for AI at Scale – where ever inference is to take place.

Inference data accompanies AI models in portable containers where ever inference is to take place.

Containerized data products follow Data Mesh principles: Inference data is bundled with the necessary transformations and policies into modular, self-contained and highly portable units.

Embedding data with its own transformations and policies results in plug-and-play data products that are easy to deploy across all computing environments. Such modularity creates significant efficiencies for distributed inference scenarios, especially in the context of extensive AI-driven value creation at edge. A true AI at Scale solution.

Plug-and-play inference data modularity enables AI at Scale

Constraints Assessment

Symmetrical to computing , Constraints Assessment on data integration is about two things: a) Verifying whether and how AI data needs in terms of training and inference are being served, and b) Assessing the completeness and maturity of AI data solutions and capabilities.

Constraints Assessment investigates how AI data needs are being served and what is the level of AI data solutions’ maturity

Here are some broad examples of checks to be made in each key assessment area:

  • AI model training – Do we have access to necessary training data in terms of large and diverse enough datasets? Do have access to labeled data to do supervised learning? Is the available data reliable in terms of quality and accuracy? Can we support necessary AI use cases across all AI technology evolutionary steps?
  • Inference – Do we have access to data in real-time in order to support use cases and tasks that require immediate response? Is the data accurate and timely? How well our real-time inference solution scales across computing environments, especially at edge?
  • Data Governance and Management – How clear data ownership is within the organization? How mature are data governance related structures and practises? How well data management scales to serve tens or even hundreds of concurrent AI use cases? If applicable, what is the status of Data Mesh deployment?
  • Batch data integration – What is the maturity level of implemented integration architecture? How well architecture addresses key ETL/ELT functionality elements? How well potential data sources have been utilized and integrated? How scalable batch integration solution is?
  • Real-time data integration – What are the real-time solution capabilities in terms of latency and throughput? What is the data quality in real-time streams in terms of consistency and noisiness? How well the solution scales to meet future needs?
  • DataOps – If applicable, what is the status of DataOps deployment? What would be the overall maturity of industrial-grade data operations in terms of quality, scale and responsiveness? What is the level of automation of data workflows? How close are we to ideal Continuous Delivery of datasets (or data products) serving various data user needs?

In practise, Constraints Assessment on data capabilities needs to be significantly more detailed and extensive than the examples above. The main thing is to be systematic in order to gain reliable and accurate view of the current state.

Constraints Assessment needs to be extensive and systematic to result in reliable and accurate current state view

Data Strategy for AI at Scale

AI builds on computing and data. Computing strategy was discussed last time , now is the time to explore key aspects of a data strategy.

The single biggest driver behind data strategy is competitive landscape becoming AI-defined with industry leaders pushing the Productivity Frontier with cutting edge AI use cases. Naturally, AI technology evolution is the underlying force shaping competitive landscapes across all industries. Those dynamics were discussed in the article AI-defined competitive landscape .

In addition, discussion on Technology evolution perspective above showed how AI data needs increase fast when moving from traditional machine learning towards multimodal generative AI. From business oriented perspective, generic use cases like Natural Language Processing become significantly more powerful with latest AI technology, as discussed in article on AI technology evolution .

Through those perspectives, we are witnessing how AI data needs become increasingly more complex as AI models progress from traditional analytics to advanced deep learning and generative AI applications:

  • Training data needs grow from structured, low-volume datasets to vast, diverse, unstructured and multimodal data sources.
  • Inference data needs evolve from batch processing to highly dynamic, real-time and context-aware data flows that support instantaneous AI decision-making and content generation.

Each evolutionary step brings its own data requirements, shaping the infrastructure, governance, and integration capabilities needed to support AI at Scale. As competitive landscape pushes towards more sophisticated AI use cases, data strategy must accommodate increasingly diverse and intensive data requirements.

To keep up with the competition, data strategy must accommodate increasingly diverse and intensive data requirements.

In essense, it boils down to three things: 1) Strategic clarity, 2) Decision-making, and 3) Systematic data capability build-up. Decision-making in the Age of AI, marked by high degree of complexity and uncertainty, was discussed in an earlier article Decision-making in AI transformation exploring ways to add strategic clarity.

In term of systematic capability build-up, Constraints Assessment is the first step – to identify and understand shortcomings and bottlenecks in order to start eliminating them one by one.

For systematic data capability build-up, Constraints Assessment is the first step.

Strategic alternatives and options

Most of the data solutions discussed above are not optional. Targeted implementation level may vary depending on business needs, but eventually there needs to be a solution for data governance, batch and real-time data integration, data cataloging, and data storage.

However, in case of DataOps, Data Fabric, Data Mesh and real-time inference there are strategic choices to be made. Let’s have a closer look.

DataOps deployment

DataOps is strategic option that has the potential to bring industrial-grade scale, quality, agility and speed to data operations. Combined with other digital capabilities – most of all AI Engineering and Software Engineering – it enables Continuous Delivery that is at the heart of digital innovation.

DataOps is not mandatory in the early phases of AI transformation but achieving AI at Scale does not appear viable without DataOps level data capabilities.

Choice between Data Fabric and Data Mesh

Data Fabric is an excellent choice as an intermediate solution to boost data cataloging with a semantic layer over existing data infrastucture. However, achieving AI at Scale calls for more than that.

In terms of AI transformation and related change management, Data Mesh requires more to implement but – contrary to Data Fabric – facilitates business domains emerging as engines for digital innovation and value creation. It can be argued that all things considered, that is the single biggest differentiator in the Age of AI.

Business Domains as engines for digital innovation and value creation appear as the single biggest differentiator in the Age of AI.

Scalability of real-time inference

Distributed inference at scale is a significant system design and technology challenge. With less than ideal connectivity, real-time AI use cases at edge remain unserved by API-based access to inference data.

Bringing inference data physically at the point of inference may not be that difficult – except when it needs to be done in a sustainable manner at scale. Containerization enabled plug-and-play data products start to appear lucrative strategic option.

Strategic objectives for digital capabilities

In the article on Alignment , five strategic objectives were set to digital capabilities: Scalability, Quality, Speed, Agility and Innovation. Previous article assessed those objectives from AI computing perspective. Now is the time for similar assessment from the perspective of data capabilities:

  • Scalability – Data capabilities have multiple different characteristics that impact scalability. The two most significant ones have been already pointed out: Data Mesh and DataOps. However, it is clear that the difference between poor and well-functioning Data Governance is vast in terms of scale achieved. On a more technical layer, choices on data storage solutions may have a decisive impact too.
  • Quality – To some extent, quality follows the suit of scalability. Again, well-functioning Data Governance and DataOps with automated extensive testing are prone to lead good quality data. However, as always, good quality is in-built in the way organization operates overall, including decisions around data cataloging and metadata management.
  • Speed – Speed in terms of getting relevant and high-quality data into the hands of data users is influenced by many factors. In ideal situation, well-establised DataOps practises not only lead to scale but speed also – both characteristics of industrial-grade data operations.
  • Agility – Agility builds on modularity. In case of real-time inference that might build on data products in containers. However, when agility is defined as responsiveness to changing market needs and opportunities, nothing beats decentralized operating model, i.e. Data Mesh, with data colocated with the best possible domain expertise.
  • Innovation – Organization’s ability to innovate builds on its way of working. In the context of AI at Scale, that is fundamentally about how well the organization is able to discover and integrate those cutting edge AI use cases. This cuts across virtually all data capabilities discussed here, from Data Governance for Quality Assurance to Data Mesh enabled Business Domains working as engines for digital innovation and value capture.

Conclusions

AI builds on data. With competitive landscape becoming AI-defined, underlying data capabilities make all the difference in terms of competitiveness in the Age of AI.

However, AI data requirements are increasingly demanding, both in terms of AI model training and inference. Each AI technology evolutionary step raises the bar higher.

Keeping up with the competition calls for strategic management in relation to data capabilities. Not only is it critical to understand data capabilities’ impact to financial performance – things like pricing power, operational efficiency, margins and growth – but also in terms of having accurate view on current state of those capabilities in order to improve.

The main purpose of this article has been to provide tools for strategic management – ultimately for better business performance in the Age of AI.

Constraints Assessment as a Service

Constraints Assessment as a Service covers all digital capability areas from Strategic Management to Data Culture. See detailed Service Description .

AI at Scale Workshop

AI at Scale workshop is a compact one-day event for business and technology executives and managers. Workshop seeks answers to the question: What should we as a business and as an organization do to secure our success in the Age of AI?

Awesome insights! ?? Excited to see the focus on Data Mesh for AI at scale! Antti Pikkusaari

要查看或添加评论,请登录

社区洞察

其他会员也浏览了