登录查看更多内容

AI is getting messy - Let's grab a RAG

Anthony M. Gonzales, MBA

Human Capital | Life Sciences | Serial Operator | Accredited Investor | Board Member | Venture Partner | 1° Black Belt BJJ

发布日期: 2023年10月26日

Think of all the various people you know, data scientists, medical professionals, engineers, CEOs, students, grandmas, employees and everyone in between, they want quick, accurate, and relevant information in response to increasingly complex inquiries. This forces new strategies and algorithms to be applied to LLMs.?Very large amounts of data and documentation are generally required to produce answers in response to conversation-type questions.??

This blog will explore the technical challenges evolved from our need for insightful information, and how modern retrieval-augmented generation (RAG) frameworks can help in contextual and broader environments – think medical documentation versus Encyclopedia Britannica – to provide a solution. Our goal, unsurprisingly, will be to apply these models to reduce costs, expedite, and improve the accuracy of answers.

Problem and Consequences

LLMs have a property called ‘context window’ which determines the number of tokens it can process, including input (prompt) and output (completion) tokens (for more information on tokens, read our previous blog). Recent R&D is pushing context windows to be longer, and industry-leading LLMs like Claude from Anthropic can process up to 100k tokens (roughly 75,000 words).

Still, many business and research applications might benefit from processing even larger volumes of text with LLM, with use cases such as information retrieval, summarization, question/answering, etc.

For example, a company might want to index its entire internal knowledge base of documents, reports, customer conversations, and so on – and use an LLM to quickly access this information in a conversation-style format.

Even when a corpus of text fits within a context window, it is unreasonably slow to tokenize and transmit the entire set of documents to the LLM whenever one needs it to respond to a question based on the given text. It’s expensive too, with the cost currently standing at $11 per million input tokens. Asking a question using Shakespeare’s works (approximately 885K words) as a primary source, each inquiry would cost about $13!

Solution

Our problem is addressed by a recently developed approach by Facebook, called retrieval-augmented generation (RAG). The idea is to use embedding vectors and a similarity search (both defined further below) to circumvent the need to pass all context (an entire knowledge base) to the LLM and instead only pass the parts relevant to answering a query.

In the following, we’ll be providing a high-level view of the steps in executing retrieval-augmented generation divided into two phases. This is followed with a deeper dive into the technology and models that play significant roles in the RAG solution. Lastly, we provide recommendations on how to adapt or modify elements of the solution in technical detail.

The described approach is visualized on the following high-level diagram (parallel with Solution):

High Level

1. Indexing flow:

Collect the entire knowledge base that contains documents (often in different formats such as PDF, Excel, relational databases, etc.)
Extract plain text from the documents (for example, parse PDFs to retrieve text)
Split the resulting corpus of text into chunks (chunking strategies may vary and lead to different results)
Use an embedding model to transform each chunk of text into an embedding vector
Save all of the embedding vectors into a vector store (a specialized data structure or database designed for efficient nearest-neighbor queries in high-dimensional spaces, commonly used in machine learning applications), along with the index (a mapping from an embedding vector back to a document reference, e.g., a filename)

2. Query flow:

Upon receiving a user query (such as a question), instruct the LLM to clean/summarize it. If it’s a conversation, at this stage, the LLM can also rephrase the query by considering the chat history.??
Use an embedding model (the same one used for indexing) to vectorize the query
Query the vector store for top K documents that are the most similar to the question vector (K is a configurable parameter)
Query the LLM giving it the list of retrieved chunks of text and the question, and instruct it to inspect the given text and answer the question

In this approach, the vector similarity query functions as a pre-filtering step so that the LLM only needs to process the text preliminarily determined as relevant to the question based on the vector similarity. The results should be highly relevant to the query (as long as the applicable information has been referenced), quicker to process (using less tokens), and potentially less expensive to execute.?

Below, a detailed diagram that depicting steps described above parallel:

And now the same diagram, annotated with details of each step and tech used:

Technology and Models

Advisory for readers: this is getting deep, though not necessarily Mariana Trench deep - that’s coming up next.

Vector similarity search

The technologies that implement vector search (also called ANN – approximate nearest neighbor) can be divided into vector libraries and vector databases.?

Vector libraries typically provide bare bones vector search functionality and tend to store vectors in memory only. Their index is usually immutable, meaning it doesn’t support deletions and updates (e.g., replacing a vector associated with a document with another vector). The most popular and best-performing algorithm for an approximate search of nearest neighbors is called HNSW (Hierarchical Navigable Small World). Examples of well-known and mature vector libraries include hnswlib and faiss.

Vector databases, on the other hand, refer to vector stores with richer functionality and implement operations typically supported by traditional databases, such as mutations (updates/deletions). Vector databases also allow storing other data besides vectors, such as the objects a vector is associated with; conversely, vector libraries respond to search queries with object IDs, requiring secondary storage to retrieve the source object by object ID. Vector databases are either built from scratch or based on vector libraries and add more functionality on top.

Vector databases are preferred for building production systems that require horizontal scaling and reliability guarantees (backups, fault tolerance, monitoring/alerting). They also support replication (for increasing reliability) and sharding (to support larger datasets).? Examples of popular/mature vector databases include:?

Milvus (open source with managed offering)
Weaviate (open source with managed offering)
Qdrant (open source with managed offering)
Vespa (open source)
Pinecone (closed source)

There are also established databases and search engines that have vector search capabilities enabled via plugins:

PostgreSQL (open source; plugins: pgvector or pgvecto.rs)?
OpenSearch / Elasticsearch (open source with managed offering)

You can compare these and more vector databases here:? https://objectbox.io/vector-database/

All DBs in the list above are distributed - operating across multiple servers.

领英推荐

Voxel51 Filtered Views Newsletter - September 20, 2024

Voxel51 5 个月前

The EU Parliament approves the European Regulation on…

Chiomenti 12 个月前

AI Weekly Digest - April 1 2024

PA Media 11 个月前

Embedding models

Embedding models turn a piece of text into a numeric representation (a vector) that is generally 50 to 1000+ numbers in length.? Choosing a model is the most important part of ensuring the quality of results in a retrieval-augmented generation system. A high-quality embedding model has an efficient architecture with a sufficient number of parameters and is trained on a large and diverse corpus of text. For similarity search purposes, a ‘good’ model produces vectors that properly capture the semantic meaning; in other words, texts that are meaningfully similar should be close to each other in the vector space with a chosen distance measure. The quality of embedding models is evaluated with benchmarks, for example, the Massive Textual Embedding Benchmark (MTEB) benchmark. Most RAG systems use a model pre-trained on a certain dataset instead of training one from scratch.?

Let’s look at the model landscape:

Proprietary models

Open-source models

Specialized models

Multilingual models

Tuning and Considerations in Technical Detail

Secondary advisory for readers:? we’re delving deep with details here - it’s pretty much the Mariana Trench.?

Embedding dimensionality

When choosing an embedding model, it is important to consider the dimensionality of the vectors it produces for the following reasons:?

It has a major impact on search and indexing performance. In HNSW-based vector stores, the computational time required to perform the search is proportional to the number of dimensions. The complexity of constructing the index is proportional to the dimensionality multiplied by the number of entries in the index.
An embedding vectors' size determines a search index's memory consumption. High memory requirements may limit the ability to store large quantities of vectors, especially in systems that use memory-only, non-distributed vector stores.
The embedding dimension should not be too small, either, since this may reduce the ability to capture meaning in a vector.?

For example, embeddings provided by OpenAI (text-embedding-ada-002 model) use 1,536 dimensions. This dimensionality is sufficiently large, and using these vectors may incur a considerable computational cost. Depending on the use case, a small model (such as all-MiniLM-L12-v2 from Sentence-Transformers with the dimension of 384) may perform just as well accuracy-wise with a significant reduction in cost and latency.

It is important to recognize, each dimension in an embedding vector captures some nuance of meaning. When considering models for analyzing text coming from multiple domains, higher dimensionality may be preferable; applications in narrow domains (such as medical) that operate with smaller and/or standardized sets of terms might work well with fewer dimensions.

Vector store tuning

Vector databases, especially vector libraries, expose multiple parameters controlling how the underlying index (most commonly based on HNSW) is built and queried.?

The primary purpose of these parameters is to balance quality against performance. Performance can be measured with query latency, the number of queries processed per second, or both. The search quality in the approximate nearest neighbors search is measured with recall, which is the fraction of documents that are true neighbors of the given query vector divided by the number of all returned documents.?

The parameters below impact the search behavior:

k - the maximum number of nearest neighbors to be returned as a result.
ef - the size of the dynamic list for the nearest neighbors (used during the search). Higher ef leads to more accurate but slower search. ef cannot be set lower than the number of queried nearest neighbors k. The value ef can be anything between k and the size of the dataset.

The following parameters control index construction:

M - the number of bi-directional links created for every new element during construction. Higher M works better on datasets with high intrinsic dimensionality and/or high recall, while low M works better for datasets with low intrinsic dimensionality and/or low recalls. In other words, high-dimensional embedding vectors (with 100+ dimensions) require higher M for optimal performance at high recall.
efConstruction - the parameter has the same meaning as ef, but controls the indexing time and the indexing accuracy. Bigger efConstruction leads to longer construction but better index quality.

Search parameters

Number of similar documents retrieved per query (commonly referred to as K as in “top K most similar”). This parameter is an upper bound on the number of entries retrieved when searching for text chunks most similar to the user query. It is applied to the set of results ordered by the distance measure (most similar first), so it is the top K most similar results. It is application-specific and should be tuned according to some evaluation criteria used for the RAG system. For example, for question answering, the evaluation can measure the accuracy and completeness of answering a set of questions. If K is set too low, some relevant text chunks will not be considered; if it is too high, more irrelevant chunks will be considered when answering the query.?
Minimum similarity. When retrieving a set of chunks most similar to the user query, you can place a limit on the minimum similarity to avoid considering results below this threshold.?

Both of these parameters can be configured in a way that causes many documents to be considered when an LLM is instructed to answer the query, given the context comprised of these documents. Generally, this might not be detrimental to accuracy since LLMs can be expected to “figure out” what part of the context is irrelevant.

Conclusion

Managing speed and costs required for responding to increasingly complex queries made against LLMs is a priority for most organizations delivering this capability.?Users of information demand accuracy, most without recognition of the volume of materials needed to provide it (why not ingest every known book ever published?) and the resources involved. We believe there are opportunities to meet much of our Users/Customers/Inquisitors needs AND balance cost and latency, by using the Retrieval Automated Generation approach and modifying it as necessary to produce the results best suited to the audience making their inquiries. In our next posts, we’ll be reviewing parsing and experimenting with parsers to efficiently interpret large quantities of complex data in a variety of formats/document structures - an additional strategy to reduce processing costs, and accurately convey insights.

Written by:? Dmitriy Tkachenko of Proxet - Connect with him here: linkedin.com/in/dmitriy-tkachenko

Professional Proselytizer

1,500 位关注者

Kate Kupriienko

1 年

????????

Anthony M. Gonzales, MBA

1 年

Did you know commenting on your own post can significantly boost it's reach? #Shameless.

1 次回应

查看更多评论

要查看或添加评论，请登录

Anthony M. Gonzales, MBA的更多文章

The Autonomous Robotics Future Is (Nearly) Here

2025年3月3日

The Autonomous Robotics Future Is (Nearly) Here

[Movie Theater Announcer Voice] "Imagine a world where robots cruise around factories, help surgeons with delicate…

1 条评论
ACG 101 Private Equity & M&A Outlook: 2025 Deals, Liquidity, and Market Trends

2025年2月27日

ACG 101 Private Equity & M&A Outlook: 2025 Deals, Liquidity, and Market Trends

It was a warm evening in Woodland Hills, CA, and the energy in the room was unmistakable. Private equity professionals,…

1 条评论
Manufacturing Mindset: How People Feel About Building Stuff—and Why It Matters

2025年2月24日

Manufacturing Mindset: How People Feel About Building Stuff—and Why It Matters

Let’s face it, the battle lines are drawn. On one side, you have younger, higher-earning, and more educated Americans…
Scientists Should Start Startups (Then Sell Them)

2025年2月17日

Scientists Should Start Startups (Then Sell Them)

If you’ve spent years in a lab pushing the boundaries of science, the idea of becoming an entrepreneur might seem..

2 条评论
Medtech M&A and VC Surge into 2025: Momentum or Mirage?

2025年2月10日

Medtech M&A and VC Surge into 2025: Momentum or Mirage?

The medtech sector came roaring back to life in 2024, with M&A deals hitting a three-year high and later-stage VC…

1 条评论
2025 Healthcare and Life Sciences Predictions

2025年1月20日

2025 Healthcare and Life Sciences Predictions

The healthcare sector has consistently demonstrated resilience, particularly during challenging times. In 2024…
What your HR Policy for AI?

2025年1月14日

What your HR Policy for AI?

Imagine it’s 1985, and a company announces it’s banning all typing. Their reasoning? They don’t understand computers…
HR - New Hire Policy Issues - AI isn't always your friend.

2025年1月10日

HR - New Hire Policy Issues - AI isn't always your friend.

Imagine this: it’s your first day at a shiny new job. Your suit is crisp, your smile is wide, and your enthusiasm is at…

6 条评论
Agentic AI - 2025 word of the year.

2025年1月6日

Agentic AI - 2025 word of the year.

Remember the era when everyone kept saying, “I wish there was an app for that”? That single phrase became a rallying…

1 条评论
Leveraging PEOs to Maximize Value Creation in Private Equity

2024年12月9日

Leveraging PEOs to Maximize Value Creation in Private Equity

By Anthony M. Gonzales, Human Capital Strategist - Former Business Banker & Operator In today’s private equity (PE)…

3 条评论

See all articles

AI is getting messy - Let's grab a RAG

Anthony M. Gonzales, MBA

Human Capital | Life Sciences | Serial Operator | Accredited Investor | Board Member | Venture Partner | 1° Black Belt BJJ

Problem and Consequences

Solution

High Level

1. Indexing flow:

2. Query flow:

Technology and Models

Vector similarity search

领英推荐

Embedding models

Proprietary models

Open-source models

Specialized models

Multilingual models

Tuning and Considerations in Technical Detail

Embedding dimensionality

Vector store tuning

Search parameters

Conclusion

Professional Proselytizer

1,500 位关注者

Anthony M. Gonzales, MBA的更多文章

社区洞察

其他会员也浏览了

How can we reduce or eliminate bias in machine learning algorithms?

EU Artificial Intelligence Act; a Step towards the World’s First Legislative Framework in the AI Industry

How to overcome context window limit of LLMs?

The MLOps Monthly Mini #11

The EU AI Act - the status quo and the February Deadline

AI helped me write this

A Practical Guide to Identifying ‘AI Systems’ for the EU AI Act

In-Sync with Speciale December Edition

Trump's Lasting Impact on AI

The European Parliament is proposing amendments to the AI Act. Significant changes may affect "content" of the algorithms and models.

Problem and Consequences

Solution

High Level

1. Indexing flow:

2. Query flow:

Technology and Models

Vector similarity search

领英推荐

Embedding models

Proprietary models

Open-source models

Specialized models

Multilingual models

Tuning and Considerations in Technical Detail

Embedding dimensionality

Vector store tuning

Search parameters

Conclusion

Professional Proselytizer

1,500 位关注者

Anthony M. Gonzales, MBA的更多文章

The Autonomous Robotics Future Is (Nearly) Here

ACG 101 Private Equity & M&A Outlook: 2025 Deals, Liquidity, and Market Trends

Manufacturing Mindset: How People Feel About Building Stuff—and Why It Matters

Scientists Should Start Startups (Then Sell Them)

Medtech M&A and VC Surge into 2025: Momentum or Mirage?

2025 Healthcare and Life Sciences Predictions

What your HR Policy for AI?

HR - New Hire Policy Issues - AI isn't always your friend.

Agentic AI - 2025 word of the year.

Leveraging PEOs to Maximize Value Creation in Private Equity

社区洞察

其他会员也浏览了

How can we reduce or eliminate bias in machine learning algorithms?

EU Artificial Intelligence Act; a Step towards the World’s First Legislative Framework in the AI Industry

How to overcome context window limit of LLMs?

The MLOps Monthly Mini #11

The EU AI Act - the status quo and the February Deadline

AI helped me write this

A Practical Guide to Identifying ‘AI Systems’ for the EU AI Act

In-Sync with Speciale December Edition

Trump's Lasting Impact on AI

The European Parliament is proposing amendments to the AI Act. Significant changes may affect "content" of the algorithms and models.