Building an Efficient AI Ingestion Pipeline: Data Ingestion Strategies
Frank Denneman
Distinguished Engineer | Chief Technologist for AI | AI and Advanced Services | VMware Cloud Foundation Division | Broadcom
Traditionally, application deployment follows a straightforward path from development to production. For example, traditional enterprise applications interact with databases to perform standardized operations, requiring predictable resource management and maintenance tasks. However, Generative AI (Gen-AI) applications bring a new paradigm that is more flexible and complex. Unlike traditional applications, Gen-AI workloads require dynamic adaptability, as they interact with continuously changing data and must evolve to meet diverse and unpredictable demands. Gen-AI applications, especially those using Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) architectures, do not fit neatly into the linear path seen in traditional workloads. Instead, they follow a circular, adaptive lifecycle consisting of two main stages: the research phase and the production phase.
Understanding these two phases is crucial for infrastructure administrators managing and scaling AI-driven applications. This requires an approach that supports continuous change and adaptation rather than a single, linear deployment.
Research phase
The research phase in Gen-AI workloads resembles an ongoing lab experiment. During this phase, data scientists continuously test different models, tweak algorithms, and explore configurations to find the optimal setup that balances accuracy, efficiency, and adaptability. This environment is characterized by rapid iteration and change, prioritizing flexibility over stability. A data scientist might adjust the chunking size or test different indexing techniques for a vector database to improve retrieval speed. This process often involves multiple rounds of experimentation to determine which approach works best for a specific data set.
To enable this flexibility, data scientists need to be able to self-manage the resources required for their work. They should be able to easily access and configure infrastructure components like Deep Learning VMs, AI Kubernetes clusters, and Vector Databases. This self-service capability allows them to quickly adapt to new findings and iterate without waiting for manual intervention from infrastructure teams.
Production phase
The production phase, in contrast, focuses on operationalizing the Gen-AI application for real-world use. The primary goals of this phase are to ensure scalability, reliability, and consistent performance, enabling the system to meet user demands effectively. During this stage, the infrastructure must support stable and efficient operation as the application shifts from an experimental environment to one that serves end-users at scale.
In this phase, the insights gained during research are used to develop production-ready deployments. Unlike the research phase, where flexibility is key, the production phase emphasizes reliability and predictability. The workloads are designed for speed and stability, allowing the application to handle large-scale or numerous user queries, adapt to new incoming data in a controlled manner, and consistently provide accurate responses.
Containerized platforms play a central role in this phase, as they provide the ability to package the application in a consistent environment that can be easily scaled. These platforms, such as Kubernetes, enable horizontal scaling either automatically or through the assistance of DevOps teams, ensuring that workloads can expand in response to increased demand. For instance, new containers can be spun up to handle the additional load if user traffic spikes, maintaining a smooth user experience.
Data ingestion strategies
Data ingestion is a critical component of Gen-AI workflows, feeding the RAG system with the data it needs to generate relevant responses. Depending on the operational requirements—whether in the research phase or in production—different ingestion strategies are used. Here are the three primary data ingestion strategies:
Automated batch ingestion efficiently handles large volumes of data, running on a set schedule without manual intervention. This approach is ideal for processing substantial amounts of information during off-peak hours, allowing for consistent and predictable updates to the knowledge base hosted by the vector database. For example, a company with a large, static dataset, such as product catalogs or research papers, can use batch ingestion to update its system nightly. Automated batch ingestion helps ensure that resource use is predictable, allowing infrastructure administrators to allocate CPU, memory, and storage resources effectively, especially during scheduled off-peak hours.
Streaming ingestion operates continuously, incorporating real-time changes to ensure access to current information. This method is invaluable for time-sensitive applications where minimizing latency between data creation and availability is crucial. From an infrastructure perspective, streaming ingestion can create unpredictable resource demands. ML Ops teams must monitor and manage resources closely to avoid bottlenecks, especially during high-volume data periods, which can impact computing, memory, and network usage. Cluster services like DRS and appropriate VM class selection help ensure resource availability.
Manual ad-hoc ingestion allows on-demand processing of specific data pieces, offering great control and flexibility. This approach is valuable when critical information needs to be added immediately, such as incorporating new legal regulations or emergency advisories into an RAG system. While it provides precise control, it could be more scalable for large or frequent updates and requires human intervention. Infrastructure administrators must be aware that ad-hoc ingestion can result in sudden, unplanned resource utilization spikes, which could affect other ongoing processes if the environment is not sufficiently flexible.
Production data ingestion strategies
Automated batch and streaming ingestion are predominantly used in production workflows, where a steady, reliable flow of data into the system is required to keep the model updated and accurate.?
When considering the infrastructure impact and resource consumption of an ingestion process, the data window and peak throughput are critical factors determining how effectively resources are utilized and how well the system can provide data freshness.
Data Window
The data window represents the time frame during which data is collected, processed, and ingested into the system. A well-defined data window is crucial for managing batch jobs efficiently. The data window's length can significantly influence infrastructure requirements:
Short Data Window: A short data window means that ingestion jobs are more frequent, leading to smaller batches of data being processed at a time. This reduces latency, allowing for near real-time updates, but can increase the overhead of managing more frequent ingestion events. Infrastructure needs to be flexible and capable of handling consistent bursts of activity, ensuring that compute, memory, and network resources are dynamically available when needed.
Long Data Window: A longer data window implies larger batches processed less frequently, which can reduce the overhead associated with frequent ingestion operations. However, the larger volume of data processed at once can place significant demands on storage, compute, and I/O resources during those ingestion cycles. The infrastructure must be capable of efficiently managing peak resource demands during these larger batch processes.
One key aspect must be considered is the data volume multiplication factor during ingestion, as data is extracted, transformed, and loaded into the vector database. Its intermediate state is typically stored either in memory or in storage. Proper cleanup is required when dealing with higher-volume datasets and long data windows. The following article covers data volume multiplication in detail.
领英推荐
Peak Throughput
Peak throughput refers to the maximum data volume the system must handle during ingestion. Understanding and managing peak throughput is essential for maintaining system performance and avoiding bottlenecks:
Compute Resources: During peak throughput periods, CPU utilization can be high but often inconsistent, mainly due to the nature of data transfer and processing. As data moves from various sources into interim storage or memory, the load on CPU resources tends to spike, driven by the demands of different ingestion and transformation tools. These tools may not be fully optimized for efficient resource management, resulting in fluctuating levels of CPU consumption. For example, the performance graph shows the CPU utilization of the data-loading file process. Although the machine has sixteen cores available, only four are intermittently utilized, even though the Python process was explicitly configured to use 16 cores.
There are several strategies to handle batch ingestion, each with distinct impacts on CPU usage and resource patterns. One approach involves collecting all new data from all sources before proceeding to the next step in the pipeline. This strategy concentrates the initial data gathering into a defined time frame, leading to high CPU and I/O activity followed by processing stages, ensuring that the embedding model running on the GPU can fully utilize the parallel processing power. This workflow makes resource usage spikey yet manageable across distinct phases.?
An alternative strategy is to process data source-by-source, completing the entire ingestion pipeline for each dataset before moving on to the next one. This serial approach allows each source to be fully ingested, transformed, and stored before starting with the next, often resulting in shorter periods of intensive resource use followed by intervals of low activity. While each source is processed, CPU and GPU utilization can vary significantly—depending on the specific characteristics of the data and the quality of the tools used.
In both scenarios, the overall resource utilization follows a spikey pattern, with short bursts of high CPU and memory use alternated with idle periods. The duration of these idle periods may vary depending on the strategy employed, data volume, and processing speed. For infrastructure administrators, understanding this spikey nature of resource consumption is crucial for optimizing resource allocation, whether reserving CPU capacity to handle peak demands or trusting on DRS and the overall idleness of the cluster to adapt dynamically as resource needs change.
To efficiently support these types of workflows, infrastructure planning should account for the irregular CPU and memory usage peaks during ingestion. This includes provisioning adequate burst capacity, balancing workload timing to avoid overlap with other critical processes, and considering whether dynamic scaling or resource quotas might help minimize inefficiencies while maintaining flexibility.
By recognizing the uneven, high-demand resource pattern during the data window, infrastructure administrators can make informed decisions to ensure a smooth data pipeline.?
Storage I/O: Peak throughput impacts storage requirements, particularly regarding I/O operations per second (IOPS). A high throughput ingestion generates many read/write operations, especially if embeddings are calculated and saved in real-time. To avoid I/O bottlenecks, it’s important to have high IOPS capabilities, such as an all-flash storage platform that can handle concurrent access efficiently.
Network Bandwidth: The ingestion process often involves transferring data between storage, processing nodes, and potentially external sources. Peak throughput pressures network bandwidth, especially if the ingestion involves high volumes of data from different locations. Ensuring that the infrastructure has sufficient network capacity or that ingestion windows are timed to avoid competing network traffic can prevent bottlenecks and maintain overall performance.
Memory Utilization: Memory usage can spike during peak ingestion periods, particularly if data transformations, such as chunking or embedding generation, require large CPU and GPU in-memory operations. The infrastructure must have sufficient RAM available to handle these processes without triggering excessive swapping or memory thrashing, which would degrade performance.
Data Freshness
With production ingestion strategies for an RAG data pipeline, data freshness and timing are key considerations, mainly when drawing from multiple sources and possibly supporting a global workforce.
Data freshness refers to how quickly updates from data sources are reflected in the system, directly impacting the relevance and accuracy of Gen-AI applications in production. For cases where the latest information is critical, such as customer support, the acceptable lag between data updates and ingestion must be minimized. Frequent data updates may require shorter ingestion intervals to maintain freshness, while less critical updates can be handled in off-peak hours to reduce load.
Timing of data ingestion must also account for the working patterns of (global) teams. When ingestion windows are planned around typical after-office hours, such as ingesting Confluence pages at night, it ensures the Gen-AI contains all the new data when a new office day starts. However, in a global company, work is happening across all time zones, making defining a single period of low activity impractical. To address this, staggered ingestion schedules may be employed, with updates occurring at different times based on regional work patterns. Alternatively, micro-batching—running smaller, frequent ingestion processes throughout the day—distributes the load evenly and helps maintain continuous data freshness while balancing system resource demands.
Infrastructure load must also be managed carefully, ensuring sufficient capacity is available for ingestion tasks without disrupting other services. Scheduling ingestion during periods of lower system load, ideally when computational and storage demands are at their lowest, prevents resource contention and ensures optimal performance. This may involve leveraging cloud infrastructure to ingest data closer to its source in a global scenario, optimizing throughput and reducing latency.
Considering the nature of different data sources is also crucial. Not all sources change with the same frequency or importance. Internal knowledge documents may need frequent updates, while external datasets or archival records may only require occasional ingestion. Assigning ingestion schedules based on the update frequency of each data source helps maintain efficiency without compromising on freshness. Additionally, prioritizing critical content, such as customer-facing documentation, ensures that the most relevant information is always up-to-date.
Ultimately, successful data ingestion for a RAG pipeline relies on a careful balance between data freshness, infrastructure efficiency, and a clear understanding of user needs. Scheduling must be flexible enough to accommodate global usage patterns while adaptive enough to respond to changing requirements. By continually monitoring activity and adapting batch schedules as needed, infrastructure teams can ensure that data is ingested optimally, supporting operational efficiency and an up-to-date knowledge base in a rapidly evolving environment.
Research data ingestion strategies
In the research phase, data ingestion tends to be more exploratory and flexible, focusing on manual ad-hoc ingestion. Here, data scientists experiment with different data sources, formats, and ingestion methods to evaluate their impact on model performance.?
However, some elements of automated batch processing can be mimicked in the research phase by using scripts on (Jupyter) notebooks to perform bulk data loads on a trial basis. These ad-hoc batch loads provide insights into resource needs, data processing times, and scalability considerations, helping data scientists refine ingestion methods before implementing them into production. In this way, the research phase serves as a proving ground, where different ingestion strategies are tested and optimized for optimal performance when transitioned to production.
Conclusion
In conclusion, managing resource utilization and ensuring peak throughput during the data window of an ingestion pipeline involves understanding the spikey and inconsistent nature of resource demands. Infrastructure administrators must consider strategic scheduling, dynamic scaling, and appropriate capacity planning to navigate these challenges effectively.
In the following article, we will explore the impact of data volume multiplication and how data growth across multiple versions, transformations, and sources can amplify infrastructure demands and complicate resource planning. We will also explore strategies to manage this growing volume efficiently and maintain a scalable, cost-effective system for supporting RAG workloads.
Product Marketing Executive
1 周Excellent post Frank Denneman
We see over and over that ingesting and preparing data is one of THE biggest challenges to getting going with enterprise AI, yet it is surprisingly hard to find good information on how to do it. This is a great piece that should be really helpful to infrastructure folks planning to support private AI workloads.