Continuous Streaming Data Ingestion in AI Models
Guillermo Wrba
Autor de "Designing and Building Solid Microservice Ecosystems", Consultor Independiente y arquitecto de soluciones ,evangelizador de nuevas tecnologias, computacion distribuida y microservicios.
In this second article about AI Architecture, i'm going to cover the concept of continuous data streaming, in the context of AI Model Inference and Processing.
As AI adoption is growing pretty quickly ultimately, and as cloud providers get more mature in terms of AI platform offerings - being Azure AI and AWS Bedrock two examples of such -, as well as open source initiatives for playing with models such as FloWise, the term "AI" raises here and there with new applications coming every day.
AI Platforms heavily rely on the data ingestion, data curation and data embedding, as they are the "foundational" capabilities that ultimately, put the source pieces of information in front of the AI Model so that AI Models - specifically LLMs - can be trained, and evaluated using the information provided. As models get tailored with specific information, they become "Experts" in a certain field, and can somewhat formulate responses to queries about concepts that have been previously "learnt".
I'm going to refer to these three-stage data processing pipeline as the "Content Pipeline". A Content Pipeline encompasses all tasks that ultimately ingest, transform and label (curate) and transform the curated data into a format that is understandable by an AI model - known as "embedding". Embedding just transforms curated data into a multi-dimensional vector representation of hundreds of dimensions (currently, 1536 for GPT-4). This vector contains numeric values that represent the semantic "meaning" of the individual "words" that have been captured from the curated data.
Below, a graphical representation of how a real Content Pipeline would looks like, but oriented to non-stream oriented processing.
One important component within this Content Pipeline architecture is the data ingestion component. Data ingestion pulls from various sources, and that also includes some streaming data sources; streaming data sources represent sources of data that can be queried continuously, and in fact, are intended to be consumed continuously, as a linear continuum of information source.
Streamed data differs from non-streamed data sources in the fact that the data is going to be ingested in a continuous manner. While you can implement the content pipeline oriented to discrete processing - such as PDF documents that are going to be uploaded from time to time - so that you can pull data following a cadence-oriented approach, non-discrete or continuous streaming have pretty different requirements in terms of how a content pipeline must be architected in order to get a fully functional solution that can stand still over time:
领英推荐
As you can see, the architecture we expect for streamlined processing of source data has pretty tougher requirements than a typical non-streamed one.
The above presented architecture decouples the various content pipeline stages by means of using a data fabric that can esentially connect the stages together via document-driven events. For this to work, Individual stages must be designed with horizontal scalability in mind, and deployed on top of a scalable infrastructure that can scale as per demand. This approach guarantees a near-real-time Content Pipeline processing, since source data is made available into the vector DB - and hence, available as knowledge to the AI Model - as soon as it becomes available in the source data store.
Typically, AI use cases are not usually oriented to bring end users near-real-time capabilities and business applicaitons created on top of AI may make use of sporadic information, that doesn't really require to implement a near-real-time approach as the above. But think a moment about other type of business requirements such as real-time decision that may leverage the above approach.
< this will continue >....