AI is getting messy - Let's grab a RAG
Anthony M. Gonzales, MBA
Human Capital | Life Sciences | Serial Operator | Accredited Investor | Board Member | Venture Partner | 1° Black Belt BJJ
Think of all the various people you know, data scientists, medical professionals, engineers, CEOs, students, grandmas, employees and everyone in between, they want quick, accurate, and relevant information in response to increasingly complex inquiries. This forces new strategies and algorithms to be applied to LLMs.?Very large amounts of data and documentation are generally required to produce answers in response to conversation-type questions.??
This blog will explore the technical challenges evolved from our need for insightful information, and how modern retrieval-augmented generation (RAG) frameworks can help in contextual and broader environments – think medical documentation versus Encyclopedia Britannica – to provide a solution. Our goal, unsurprisingly, will be to apply these models to reduce costs, expedite, and improve the accuracy of answers.
Problem and Consequences
LLMs have a property called ‘context window’ which determines the number of tokens it can process, including input (prompt) and output (completion) tokens (for more information on tokens, read our previous blog). Recent R&D is pushing context windows to be longer, and industry-leading LLMs like Claude from Anthropic can process up to 100k tokens (roughly 75,000 words).
Still, many business and research applications might benefit from processing even larger volumes of text with LLM, with use cases such as information retrieval, summarization, question/answering, etc.
For example, a company might want to index its entire internal knowledge base of documents, reports, customer conversations, and so on – and use an LLM to quickly access this information in a conversation-style format.
Even when a corpus of text fits within a context window, it is unreasonably slow to tokenize and transmit the entire set of documents to the LLM whenever one needs it to respond to a question based on the given text. It’s expensive too, with the cost currently standing at $11 per million input tokens. Asking a question using Shakespeare’s works (approximately 885K words) as a primary source, each inquiry would cost about $13!
Solution
Our problem is addressed by a recently developed approach by Facebook, called retrieval-augmented generation (RAG). The idea is to use embedding vectors and a similarity search (both defined further below) to circumvent the need to pass all context (an entire knowledge base) to the LLM and instead only pass the parts relevant to answering a query.
In the following, we’ll be providing a high-level view of the steps in executing retrieval-augmented generation divided into two phases. This is followed with a deeper dive into the technology and models that play significant roles in the RAG solution. Lastly, we provide recommendations on how to adapt or modify elements of the solution in technical detail.
The described approach is visualized on the following high-level diagram (parallel with Solution):
High Level
1. Indexing flow:
2. Query flow:
In this approach, the vector similarity query functions as a pre-filtering step so that the LLM only needs to process the text preliminarily determined as relevant to the question based on the vector similarity. The results should be highly relevant to the query (as long as the applicable information has been referenced), quicker to process (using less tokens), and potentially less expensive to execute.?
Below, a detailed diagram that depicting steps described above parallel:
And now the same diagram, annotated with details of each step and tech used:
Technology and Models
Advisory for readers: this is getting deep, though not necessarily Mariana Trench deep - that’s coming up next.
Vector similarity search
The technologies that implement vector search (also called ANN – approximate nearest neighbor) can be divided into vector libraries and vector databases.?
Vector libraries typically provide bare bones vector search functionality and tend to store vectors in memory only. Their index is usually immutable, meaning it doesn’t support deletions and updates (e.g., replacing a vector associated with a document with another vector). The most popular and best-performing algorithm for an approximate search of nearest neighbors is called HNSW (Hierarchical Navigable Small World). Examples of well-known and mature vector libraries include hnswlib and faiss.
Vector databases, on the other hand, refer to vector stores with richer functionality and implement operations typically supported by traditional databases, such as mutations (updates/deletions). Vector databases also allow storing other data besides vectors, such as the objects a vector is associated with; conversely, vector libraries respond to search queries with object IDs, requiring secondary storage to retrieve the source object by object ID. Vector databases are either built from scratch or based on vector libraries and add more functionality on top.
Vector databases are preferred for building production systems that require horizontal scaling and reliability guarantees (backups, fault tolerance, monitoring/alerting). They also support replication (for increasing reliability) and sharding (to support larger datasets).? Examples of popular/mature vector databases include:?
There are also established databases and search engines that have vector search capabilities enabled via plugins:
You can compare these and more vector databases here:? https://objectbox.io/vector-database/
All DBs in the list above are distributed - operating across multiple servers.
领英推荐
Embedding models
Embedding models turn a piece of text into a numeric representation (a vector) that is generally 50 to 1000+ numbers in length.? Choosing a model is the most important part of ensuring the quality of results in a retrieval-augmented generation system. A high-quality embedding model has an efficient architecture with a sufficient number of parameters and is trained on a large and diverse corpus of text. For similarity search purposes, a ‘good’ model produces vectors that properly capture the semantic meaning; in other words, texts that are meaningfully similar should be close to each other in the vector space with a chosen distance measure. The quality of embedding models is evaluated with benchmarks, for example, the Massive Textual Embedding Benchmark (MTEB) benchmark. Most RAG systems use a model pre-trained on a certain dataset instead of training one from scratch.?
Let’s look at the model landscape:
Proprietary models
Open-source models
Specialized models
Multilingual models
Tuning and Considerations in Technical Detail
Secondary advisory for readers:? we’re delving deep with details here - it’s pretty much the Mariana Trench.?
Embedding dimensionality
When choosing an embedding model, it is important to consider the dimensionality of the vectors it produces for the following reasons:?
For example, embeddings provided by OpenAI (text-embedding-ada-002 model) use 1,536 dimensions. This dimensionality is sufficiently large, and using these vectors may incur a considerable computational cost. Depending on the use case, a small model (such as all-MiniLM-L12-v2 from Sentence-Transformers with the dimension of 384) may perform just as well accuracy-wise with a significant reduction in cost and latency.
It is important to recognize, each dimension in an embedding vector captures some nuance of meaning. When considering models for analyzing text coming from multiple domains, higher dimensionality may be preferable; applications in narrow domains (such as medical) that operate with smaller and/or standardized sets of terms might work well with fewer dimensions.
Vector store tuning
Vector databases, especially vector libraries, expose multiple parameters controlling how the underlying index (most commonly based on HNSW) is built and queried.?
The primary purpose of these parameters is to balance quality against performance. Performance can be measured with query latency, the number of queries processed per second, or both. The search quality in the approximate nearest neighbors search is measured with recall, which is the fraction of documents that are true neighbors of the given query vector divided by the number of all returned documents.?
The parameters below impact the search behavior:
The following parameters control index construction:
Search parameters
Both of these parameters can be configured in a way that causes many documents to be considered when an LLM is instructed to answer the query, given the context comprised of these documents. Generally, this might not be detrimental to accuracy since LLMs can be expected to “figure out” what part of the context is irrelevant.
Conclusion
Managing speed and costs required for responding to increasingly complex queries made against LLMs is a priority for most organizations delivering this capability.?Users of information demand accuracy, most without recognition of the volume of materials needed to provide it (why not ingest every known book ever published?) and the resources involved. We believe there are opportunities to meet much of our Users/Customers/Inquisitors needs AND balance cost and latency, by using the Retrieval Automated Generation approach and modifying it as necessary to produce the results best suited to the audience making their inquiries. In our next posts, we’ll be reviewing parsing and experimenting with parsers to efficiently interpret large quantities of complex data in a variety of formats/document structures - an additional strategy to reduce processing costs, and accurately convey insights.
Written by:? Dmitriy Tkachenko of Proxet - Connect with him here: linkedin.com/in/dmitriy-tkachenko
????????
Human Capital | Life Sciences | Serial Operator | Accredited Investor | Board Member | Venture Partner | 1° Black Belt BJJ
1 年Did you know commenting on your own post can significantly boost it's reach? #Shameless.