Continuous Streaming Data Ingestion in AI Models

In this second article about AI Architecture, i'm going to cover the concept of continuous data streaming, in the context of AI Model Inference and Processing.

As AI adoption is growing pretty quickly ultimately, and as cloud providers get more mature in terms of AI platform offerings - being Azure AI and AWS Bedrock two examples of such -, as well as open source initiatives for playing with models such as FloWise, the term "AI" raises here and there with new applications coming every day.

AI Platforms heavily rely on the data ingestion, data curation and data embedding, as they are the "foundational" capabilities that ultimately, put the source pieces of information in front of the AI Model so that AI Models - specifically LLMs - can be trained, and evaluated using the information provided. As models get tailored with specific information, they become "Experts" in a certain field, and can somewhat formulate responses to queries about concepts that have been previously "learnt".

I'm going to refer to these three-stage data processing pipeline as the "Content Pipeline". A Content Pipeline encompasses all tasks that ultimately ingest, transform and label (curate) and transform the curated data into a format that is understandable by an AI model - known as "embedding". Embedding just transforms curated data into a multi-dimensional vector representation of hundreds of dimensions (currently, 1536 for GPT-4). This vector contains numeric values that represent the semantic "meaning" of the individual "words" that have been captured from the curated data.

Below, a graphical representation of how a real Content Pipeline would looks like, but oriented to non-stream oriented processing.

Discrete-processing oriented AI Content Pipeline


One important component within this Content Pipeline architecture is the data ingestion component. Data ingestion pulls from various sources, and that also includes some streaming data sources; streaming data sources represent sources of data that can be queried continuously, and in fact, are intended to be consumed continuously, as a linear continuum of information source.

Streamed data differs from non-streamed data sources in the fact that the data is going to be ingested in a continuous manner. While you can implement the content pipeline oriented to discrete processing - such as PDF documents that are going to be uploaded from time to time - so that you can pull data following a cadence-oriented approach, non-discrete or continuous streaming have pretty different requirements in terms of how a content pipeline must be architected in order to get a fully functional solution that can stand still over time:

  • Data Ingestion: First of all, data must flow-in continuously, so data connectors must be able to stream the data and feed into continuously; using a landing data store - a special type of document store - will not work because of the amount of data ingested is not compatible with a DB processing approach. Data must typically be splitted down into individual chunks, and transmitted through a data fabric in the form of data events.
  • Data Curation: typical data curation occurs by reading documents already validated and ingested from within a Landing Store, and that is valid for non-streamed data processing. For streaming data processing, validated data must be consumed from a data fabric, in the form of events, and that ingestion must be able to scale horizontally as needed, since - remember - data is flowing continuously. Curation of data involving annotation and contextualization must happen on-the-fly, as events flow through. In the same direction, the resulting curated pieces of data cannot be stored within a curated data store, but otherwise injected as "curated events" into a data fabric so events can continue its flow through the pipeline.
  • Data Embedding: in a same way, embedding typically happens by reading curated documents, getting those documents from a Curated data store. In a streaming-oriented content pipeline such curated data must be read from within the data fabric, and fed into the embedding process, that must be able to scale horizontally depending on the streaming volume. The resulting embedded data must be - following same approach - fed into the data fabric via events.
  • Ingestion of embedded data into Vector DB: ingesting embedded data events from the data fabric must occur at much higher rate. This necessarily needs to be designed for high performance. Vector Database design must also be able to handle high volume and concurrency.

As you can see, the architecture we expect for streamlined processing of source data has pretty tougher requirements than a typical non-streamed one.

Streaming-based AI Content Pipeline


The above presented architecture decouples the various content pipeline stages by means of using a data fabric that can esentially connect the stages together via document-driven events. For this to work, Individual stages must be designed with horizontal scalability in mind, and deployed on top of a scalable infrastructure that can scale as per demand. This approach guarantees a near-real-time Content Pipeline processing, since source data is made available into the vector DB - and hence, available as knowledge to the AI Model - as soon as it becomes available in the source data store.

Typically, AI use cases are not usually oriented to bring end users near-real-time capabilities and business applicaitons created on top of AI may make use of sporadic information, that doesn't really require to implement a near-real-time approach as the above. But think a moment about other type of business requirements such as real-time decision that may leverage the above approach.

< this will continue >....




要查看或添加评论,请登录

Guillermo Wrba的更多文章

社区洞察

其他会员也浏览了