登录查看更多内容

High Fidelity RAG - Understanding Precision

Gabriel Keith

Technical Leader

发布日期: 2024年10月30日

+ 关注

Complicated AI Subjects in Simple Terms Series

The Basics

(I'll keep this brief as I assume most of you are aware with the concept): Retrieval Augmented Generation (RAG) is a process that uses stored data to assist AI in answering domain specific questions that can be proprietary or targeted for a business case. The data is stored in Vectors in a database to aid in similarity search and is passed to the AI as "Context" that is designed to answer a user's questions.

Understanding Fidelity

So, there are a few pieces here that can contribute to what I call RAG Data Fidelity:

Your documents or data
Your vector database
Your chunk size
Your embeddings model

To understand fidelity, you have to understand how this data gets stored in the vector database. To ingest these documents into the database, you use an embeddings model that has its own fidelity. When I built our internal Hyland Software AI Bot called Romanzo, I was on the Ada V2 embeddings model by OpenAI. This wasn't naivety, it was the benchmark back in November of last year. The embeddings model creates a large vector of numbers that contain 1,536 32-bit integers with the data passed to it. So, that is the fidelity of the embeddings model or precision of how it "encodes" the data it is sent. Cool, so now you understand fidelity, right? Nope. Next concept.

Side note: I am moving my entire 1.5 million row vector database to the latest text-embedding-3-large embeddings model over the next coming weeks... I know fidelity gains when I see it!

We still have 1 and 3 from above to discuss! Documents and chunk size. To embed your documents into this array of numbers, you have to break them up first. There is a myriad of "chunking strategies" that I think kind of get in the way sometimes for an effective system. I am not going to dive into that and focus on some core concepts. What I have done with chunking is used a predefined and naive (there is that word again) approach to my chunking. I am breaking my documents up into 400 "word" chunks with overlap. The overlap ensures that if I break up a paragraph at the end of the 400, I capture its entire context in most cases.

领英推荐

TAI #136: DeepSeek-R1 Challenges OpenAI-o1 With ~30x…

Towards AI 1 个月前

How Knowledge Graphs Enhance LLM Application…

Data Science Dojo 4 个月前

LLMs for Simulated User Feedback, Causal AI, AI Slide…

Open Data Science Conference (ODSC) 9 个月前

Visual representation of 400 words converted into 1536 vector space

Now we are getting somewhere. We know how many numbers are in the vectors and we know how much data is in each "embedded chunk". This is where I think Fidelity shows up best. When you embed a chunk of data from your document, it is represented as the vector size. So... roughly 400 words gets represented in 1,536 numbers. Are you seeing it now? If I raise the chunk size, it loses fidelity! Precision is key here. Some people chunk the data in varying lengths. This is a debatable and sometimes good process. But what it does is makes the fidelity of each individual chunk vary. I wouldn't want to shove the concept of a 1000-word page into the same size vector space as 400 words. It obviously will lose precision.

Say you chunk by page or paragraph; those constantly vary in length. A long paragraph is less precisely represented because you HAVE to use the same vector size as your database column and the embeddings model allows. Phew, finally mentioned the DB. Vector search in the database only allows a rigid number of vectors. This enables similarity search. Not only does the embedding model create a standard array of numbers, but the database also has that limitation as well.

To Summarize My Position

Predefined chunk size forces specific fidelity for all content
Use overlap to maintain context in previous chunks
Use the best embeddings model with higher vector space
Varying chunk size does not provide consistent fidelity of your data
High accuracy rates CAN be achieved with naive (he said it again) chunking strategies

Obligatory Counter Point

I understand that studies have shown that certain chunking strategies that vary in length increase overall performance in certain benchmarks, but I have not seen anything that is consistent. I posit that Fidelity of the vectors was never considered in many studies
The latest MIT findings show that advanced chunking which adds summaries to your data at a predefined length seems to be the most masterful approach. (A lot of cost here including AI in your vectorization flow)

Gabriel Keith

Technical Leader

4 个月

A podcast about my research. https://soundcloud.com/gabriel-keith-870911066/high-fidelity-rag-unlocking-precision?si=48ef95ad8e5140f38ae903a1c2bc1a1b&utm_source=clipboard&utm_medium=text&utm_campaign=social_sharing

1 次回应

要查看或添加评论，请登录

Gabriel Keith的更多文章

When Cats, Dogs, and Birds Become One: Understanding Soft Labels and Monte Carlo Conformal Predictions

2024年12月29日

When Cats, Dogs, and Birds Become One: Understanding Soft Labels and Monte Carlo Conformal Predictions

Have you ever looked at a picture and thought, “Is that a cat, or maybe a dog… or wait, it has feathers?” Probably not,…

3 条评论
LangChain – Great for Prototypes, But Watch Out in Production

2024年11月9日

LangChain – Great for Prototypes, But Watch Out in Production

Complicated AI Subjects in Simple Terms Series The Basics LangChain is a well-known python library in the AI space…

1 条评论

High Fidelity RAG - Understanding Precision

Gabriel Keith

Technical Leader

The Basics

Understanding Fidelity

领英推荐

To Summarize My Position

Obligatory Counter Point

Gabriel Keith的更多文章

社区洞察

其他会员也浏览了

Open-Source Synthetic Data Tools, AI Voice Agents, World Models, Retrieval Systems, and Special ODSC Europe and West?Deals

AI Builders Week 2 Highlights, Common Prompting Mistakes, AI Grounding, and Foundation Models

ODSC's AI Weekly Recap: Week of July 12th

Building Both a Fraud & Not-Fraud Model, Leveraging Time-Series Segmentation, and Top AI News in 2023 So?Far

Miguel Martinez's Journey in Data and AI: A Path of Curiosity, Risk, and Passion

Navigating the AI Landscape: RAG, Rockset's New Chapter, and the Power of Text Search

Synthetic Data for AI Models: A New Path Forward

AI/ML Digest | Issue 35

Insights into Global AI Usage Are Here??

How to Spot AI-Generated Content

The Basics

Understanding Fidelity

领英推荐

To Summarize My Position

Obligatory Counter Point

Gabriel Keith的更多文章

When Cats, Dogs, and Birds Become One: Understanding Soft Labels and Monte Carlo Conformal Predictions

LangChain – Great for Prototypes, But Watch Out in Production

社区洞察

其他会员也浏览了

Open-Source Synthetic Data Tools, AI Voice Agents, World Models, Retrieval Systems, and Special ODSC Europe and West?Deals

AI Builders Week 2 Highlights, Common Prompting Mistakes, AI Grounding, and Foundation Models

ODSC's AI Weekly Recap: Week of July 12th

Building Both a Fraud & Not-Fraud Model, Leveraging Time-Series Segmentation, and Top AI News in 2023 So?Far

Miguel Martinez's Journey in Data and AI: A Path of Curiosity, Risk, and Passion

Navigating the AI Landscape: RAG, Rockset's New Chapter, and the Power of Text Search

Synthetic Data for AI Models: A New Path Forward

AI/ML Digest | Issue 35

Insights into Global AI Usage Are Here??

How to Spot AI-Generated Content