登录查看更多内容

10 Reasons Why Chunking for AI or RAG is Hard (with examples)

Raja Rao DV

CEO of a stealth startup Former VP of Growth @ Redis, Semgrep, Applitools; Ex Salesforce, VMware and Yahoo!

发布日期: 2024年8月18日

“Chunking” refers to breaking down documents into bite-sized chunks so LLMs can retrieve precise and high-quality information. However, it's really hard.

But first, let's understand why do we even need to do chunking in the first place?

Please note that to make things easier, I've created screenshots of the text with highlights to illustrate the chunks and their issues for each section. If you reading the screenshots, you can skip most of the text.

OK, let’s take an example to understand this. Let’s say you have a book with 3 chapters that you want to feed to AI and then ask questions about it. For now, let’s assume that each chapter has just one paragraph for simplicity.

Chapter 1: Dogs "Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Regular health check-ups are necessary to keep them healthy."

Chapter 2: Cats "Cats are independent and curious animals. Common breeds include Siamese, Persian, and Maine Coon. Cats often prefer to be left alone but enjoy playing with toys. They are usually litter trained, and it's important to monitor their diet for health."

Chapter 3: Birds "Birds are colorful and social creatures. Popular pet birds include Parakeets, Canaries, and Cockatiels. They need a balanced diet of seeds and fruits. Birds also require a clean cage and regular interaction to stay happy and healthy."

Let’s say we fed the whole thing as-is to AI. i.e. the whole book is just 1 chunk.

Improper Chunk: "Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Regular health check-ups are necessary to keep them healthy. Cats are independent and curious animals. Common breeds include Siamese, Persian, and Maine Coon. Cats often prefer to be left alone but enjoy playing with toys. They are usually litter trained, and it's important to monitor their diet for health. Birds are colorful and social creatures. Popular pet birds include Parakeets, Canaries, and Cockatiels. They need a balanced diet of seeds and fruits. Birds also require a clean cage and regular interaction to stay happy and healthy."

Let’s ask a question to our AI “tell me something about dogs”

Question:

"Tell me something about dogs."

AI’s Potential Answer with Improper Chunking:

"Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Cats are independent and curious animals. Popular pet birds include Parakeets, Canaries, and Cockatiels."

Problem:

Mixed Information: The AI provides information about dogs but also accidentally includes details about cats and birds, leading to a confusing answer.

But for our example, if we were to break it up into 3 chunks, one for each chapter, it'd have worked (at least for our example and our specific question).

AI returning correct answer to our question

Again, this only works for this specific question and assumes that the chapter only talks about one animal. For example, if the question was "Tell me the differences between cats and dogs", then this chapter-wise chunking may not have worked.

This is why chunking is hard. You really need to look into how your document is written, and anticipate various questions to ensure you use the right strategies.

????Let’s look at the top 10 challenges with Chunking.?

1. Context Preservation?

Context preservation is a significant challenge in chunking for AI because it involves breaking down large volumes of information into smaller, manageable parts, which can lead to the loss of overall context.?

When information is segmented into smaller chunks, the connections between these parts may become obscured, leading to misunderstandings or misinterpretations.

Let’s take an example of a paragraph of text and see how improper chunking can result in incorrect response.

Original Paragraph:

"Greenhouse gasses, such as carbon dioxide and methane, trap heat in the Earth's atmosphere, leading to a warming effect known as global warming. This warming effect is responsible for melting polar ice caps, rising sea levels, and more extreme weather patterns around the world.???

Suppose we use fixed-size chunking, and the paragraph is split into two chunks like this:

Improper Chunking Example:

Chunk 1: "Greenhouse gasses, such as carbon dioxide and methane, trap heat in the Earth's atmosphere, leading to a warming effect known as global warming."

Chunk 2: "This warming effect is responsible for melting polar ice caps, rising sea levels, and more extreme weather patterns around the world."

Impact of Improper Chunking

Now, imagine you ask the AI the following question:

AI Question: "How do greenhouse gasses contribute to extreme weather patterns?"

What the AI Might Do:

If the AI only looks at Chunk 1: The AI might explain that greenhouse gasses trap heat but won’t connect this to the extreme weather patterns because that information is in Chunk 2.
If the AI only looks at Chunk 2: The AI might talk about extreme weather patterns and rising sea levels but might not explain that these effects are due to greenhouse gasses.

Result: The AI’s response could be incomplete or confusing because it’s missing the

connection between the greenhouse gasses and their impact on the climate.

A common approach is fixed-size chunking, which breaks text into uniformly sized pieces based on a predefined number of characters, words, or tokens. It's simple to implement and computationally efficient, but it can cut off important semantic boundaries as it is done by an arbitrary character count.

As an example:

unstructured.io , one of the leaders in this space, performs chunking using logical, contextual boundaries—a technique known as Smart Chunking. This allows more relevant segments of data to be retrieved and passed as context to the LLM. Check out this blog post to understand the various strategies Unstructured uses to arrive at Smart Chunking.

2. Handling Diverse Document Structures

Handling diverse document structures is a problem in chunking for AI because different types of documents require different chunking strategies to maintain their semantic integrity and usability.

Original Document (contains both the following text and the table)

Text: "Our marketing team implemented three key strategies this year to increase customer engagement and sales. The first strategy focused on social media campaigns, the second on email marketing, and the third on influencer partnerships. Below is a table showing the success rates of each strategy in terms of conversion rates."

This table is part of the original document

Improper Chunking Example

Let’s assume fixed-size chunking splits the document into the following two chunks:

Chunk 1 (Text Only): "Our marketing team implemented three key strategies this year to increase customer engagement and sales. The first strategy focused on social media campaigns, the second on email marketing, and the third on influencer partnerships."

Chunk 2 (Table Only): This is the table's data

Let’s ask a question to the AI..”"Which marketing strategy was most successful?"

AI’s Potential Answer with Improper Chunking:

If the AI retrieves only Chunk 1: "The first strategy focused on social media campaigns, the second on email marketing, and the third on influencer partnerships."
If the AI retrieves only Chunk 2: "The highest conversion rate is 20%."

As an example, this Unstructured blogpost notes, “The effectiveness of RAG architectures are closely tied to how well models can retrieve information relevant to a prompt, that is stored in an external database. As RAG has become more widely adopted, developers have typically treated documents as a stream of text, and not accounted for the nuanced relationships between different types of elements such as titles, tables, and body text. We have found, however, that the performance of these architectures on information retrieval and Q&A generation with more sophisticated document preprocessing.”

3 Balancing Chunk Sizes

Balancing chunk sizes in chunking for AI is a complex problem due to several factors that need to be considered to optimize the performance and accuracy of AI applications.

Imagine a legal document discussing the terms and conditions of a contract. Here’s an excerpt that spans multiple paragraphs.

Original Legal Document Excerpt:

Paragraph 1: "The parties agree that the seller will deliver the goods to the buyer within 30 days of the purchase date. The seller shall bear all costs associated with the delivery. In the event of any delays, the buyer reserves the right to cancel the contract."

Paragraph 2: "Furthermore, the buyer is entitled to a full refund if the goods are not delivered in the condition agreed upon in this contract. The seller must notify the buyer of any potential issues with the delivery at least 5 days in advance."

Paragraph 3: "Any disputes arising from this contract shall be resolved through arbitration, and both parties agree to abide by the decision of the arbitrator. The arbitration process shall take place in the jurisdiction where the buyer is located."

Let's walk through a specific example to illustrate the challenge of balancing chunk sizes in the context of a legal document.

?

?Improper Chunking Example

Scenario 1: Chunking Too Small Let’s say we chunk the document into very small pieces, such as by sentence:

Chunk 1: "The parties agree that the seller will deliver the goods to the buyer within 30 days of the purchase date."
Chunk 2: "The seller shall bear all costs associated with the delivery."
Chunk 3: "In the event of any delays, the buyer reserves the right to cancel the contract."
Chunk 4: "Furthermore, the buyer is entitled to a full refund if the goods are not delivered in the condition agreed upon in this contract."

Impact of Chunking Too Small

Lack of Context: If the AI retrieves only Chunk 3 ("In the event of any delays, the buyer reserves the right to cancel the contract."), it might miss the conditions under which this cancellation can occur, leading to an incomplete understanding.

Fragmented Information: The AI may struggle to connect the small pieces of information, especially if it retrieves non-consecutive chunks. For example, retrieving Chunk 4 (about the refund) without Chunk 1 (delivery timing) loses the relationship between delivery issues and the right to a refund.

Scenario 2: Chunking Too Large Now, let’s chunk the document into one large piece:

Chunk 1: "The parties agree that the seller will deliver the goods to the buyer within 30 days of the purchase date. The seller shall bear all costs associated with the delivery. In the event of any delays, the buyer reserves the right to cancel the contract. Furthermore, the buyer is entitled to a full refund if the goods are not delivered in the condition agreed upon in this contract. The seller must notify the buyer of any potential issues with the delivery at least 5 days in advance. Any disputes arising from this contract shall be resolved through arbitration, and both parties agree to abide by the decision of the arbitrator. The arbitration process shall take place in the jurisdiction where the buyer is located."

Impact of Chunking Too Large

Information Overload: The AI might struggle to process this entire chunk due to its length. If the AI has memory or token limits, it may truncate the chunk, potentially losing critical information at the end.
Reduced Precision: If the AI retrieves this large chunk in response to a specific query (e.g., "What are the arbitration terms?"), it may also retrieve irrelevant information about delivery and refunds, reducing the precision of the response.

4. Semantic Coherence?

Semantic coherence is a problem in chunking for AI because it involves ensuring that each chunk of text maintains a meaningful and contextually relevant segment of the original document.

Original News Article Excerpt:

"A major storm hit the city yesterday, causing widespread damage. The storm knocked down power lines, leaving thousands without electricity. In response, the local government has declared a state of emergency and is working to restore power as quickly as possible."

Improper Chunking Example:

Suppose we use fixed-size chunking that splits the article into two chunks based on word count:

Chunk 1: "A major storm hit the city yesterday, causing widespread damage. The storm knocked down power lines, leaving thousands without electricity."

Chunk 2: "In response, the local government has declared a state of emergency and is working to restore power as quickly as possible."

Question:

"What actions did the local government take in response to the storm?"

领英推荐

Welcoming the Year of the Snake – But Should Snakes Be…

Animals Asia 1 个月前

Poodle: The Versatile Genius

PetWarehouse 9 个月前

Siberian Husky: The Majestic Adventurer

PetWarehouse 7 个月前

Potential AI Response with Improper Chunking:

If the AI only retrieves Chunk 1: The AI might say, "The storm knocked down power lines, leaving thousands without electricity," but it won’t mention what the government did in response because that information is in Chunk 2.

If the AI only retrieves Chunk 2: The AI might say, "The local government declared a state of emergency and is working to restore power," but it might not clearly connect this action to the storm because the storm is only described in Chunk 1.

Read this blog post to learn more.

5. Overlapping of Information?

Overlapping of information is a problem in chunking for AI because it can lead to inefficiencies and inaccuracies in data processing and retrieval. Overlapping chunks may contain repeated information, which can lead to redundancy. This redundancy can increase storage requirements and computational costs, as the same information is processed multiple times. Overlapping information can create confusion in maintaining the context of the data.While some overlap might be necessary to preserve context across chunks, finding the right balance is crucial.?

Let's consider an example involving overlapping chunks of information:?

Original Text:

"A car engine works by converting fuel into energy. The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons. This movement of the pistons turns the crankshaft, which ultimately powers the car's wheels."

Overlapping Chunking Example:

To maintain context, you might create two overlapping chunks:

Chunk 1: "A car engine works by converting fuel into energy. The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons."

Chunk 2: "The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons. This movement of the pistons turns the crankshaft, which ultimately powers the car's wheels."

Duplication Issue

Duplicated Sentence: "The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons." appears at the end of Chunk 1 and the beginning of Chunk 2.
Impact: This duplication helps maintain context between chunks, ensuring that the AI understands how the ignition of fuel leads to the movement of the pistons and eventually powers the car. However, it also means that the same information is stored twice, which increases the data size and processing time.

Why It’s a Problem

Increased Data Size: The duplicated sentence means you're storing more data than necessary, which can be inefficient, especially in large documents.
Slower Processing: The AI may take longer to process and retrieve information because it has to handle more data, even though much of it is redundant.

6. Boundary Detection?

Boundary detection is a problem in chunking for AI due to several challenges associated with accurately identifying where to split text while maintaining semantic coherence and relevance.

For example:

Original Recipe Instructions:"To make the cake, first mix the flour, sugar, and eggs in a large bowl until smooth. Then, pour the mixture into a baking pan and bake at 350°F for 30 minutes."

Improper Chunking Example:

Imagine the instructions are split into two chunks based on a fixed word count:

Chunk 1: "To make the cake, first mix the flour, sugar, and eggs in a large bowl until smooth."

Chunk 2: "Then, pour the mixture into a baking pan and bake at 350°F for 30 minutes."

Question to AI:

"How do I bake the cake?"

AI’s Potential Answer with Improper Chunking:

If the AI retrieves only Chunk 1: "To bake the cake, first mix the flour, sugar, and eggs in a large bowl until smooth."

If the AI retrieves only Chunk 2: "Pour the mixture into a baking pan and bake at 350°F for 30 minutes."

7. Handling Non-Textual Data?

Handling non-textual data is a problem in chunking for AI due to several challenges related to the diverse nature and structure of such data. Non-textual data includes images, audio, video, and other multimedia formats, each with its own structure and characteristics. Chunking strategies that work for text may not be directly applicable to these formats, requiring specialized approaches to effectively segment and process them.

For example:

Text: "A recent study investigated the impact of different diets on weight loss. Participants followed either a low-carb diet or a low-fat diet for six months. The table below shows the average weight loss for each group."

Table:

Improper Chunking Example:

Imagine the text and the table are chunked separately:

Chunk 1 (Text Only): "A recent study investigated the impact of different diets on weight loss. Participants followed either a low-carb diet or a low-fat diet for six months."

Chunk 2 (Table Only): Table data

Question to AI:

"How effective was the low-carb diet in the study?"

AI’s Potential Answer with Improper Chunking:

If the AI retrieves only Chunk 1: "The study investigated the impact of different diets on weight loss, including a low-carb diet." (But it doesn’t provide the specific results.)

If the AI retrieves only Chunk 2: "The average weight loss for the low-carb diet was 10 kg." (But it doesn’t explain that this data is from a study comparing different diets.)

8. Impact on Retrieval Performance:?

Suppose you’re searching for information about a company’s financial performance in a large document. If the document was chunked poorly, you might get a chunk that mentions revenue but not the context or explanation of why it changed. This can lead to incomplete retrieval of answers, making the search less effective.

Let's imagine you're searching for information about a company's financial performance in a large annual report. The report is divided into chunks, but the chunking was done poorly, leading to fragmented information. Here's an example of how this might look:

Improper Chunking Example:

Chunk 1: "In 2023, the company's revenue increased by 15% compared to the previous year."

Chunk 2:"The strategic initiatives implemented in the second quarter played a significant role in driving this growth. These initiatives included expanding into new markets, launching innovative products, and enhancing customer engagement through digital platforms."

Question:? “How did Company X perform during the Financial year 2023?

AI’s Potential Answer with Improper Chunking:

In this example, if you only have access to Chunk 1, you would know that the revenue increased by 15% in 2023, but you wouldn't understand why this change occurred. The context and explanation for the revenue increase are provided in Chunk 2, which discusses strategic initiatives like market expansion and product launches. Without both chunks, your understanding of the company's financial performance is incomplete.

9. Balancing Granularity and Generalization?

Balancing granularity and generalization in chunking for AI is a problem due to the need to optimize the trade-off between detailed segmentation and maintaining broader context. Achieving the right level of granularity is crucial to preserving the semantic integrity of the data. If chunks are too granular, they may lose important contextual information, making it difficult for AI systems to understand the broader meaning. Conversely, if chunks are too generalized, they may include irrelevant information, reducing the precision of the AI's output.

For example, imagine an article that talks about how to setup Wifi. Let's see how the two choices affect:

Option 1: Detailed Task-Based Chunking

Option 2: Section-Based Chunking

Example Question:

"How do I connect my smartphone to a Wi-Fi network?"

AI's Potential Answer (Using Detailed Task-Based Chunking):

"1. Open the Settings app on your smartphone. 2. Tap on 'Wi-Fi' from the list of options. 3. Ensure Wi-Fi is turned on by toggling the switch. 4. Select your desired Wi-Fi network from the list. 5. Enter the network password if prompted. 6. Tap 'Connect.'"

AI's Potential Answer (Using Section-Based Chunking):

"To connect to a Wi-Fi network, go to the 'Settings' section of your smartphone. Under 'Wi-Fi Settings,' make sure Wi-Fi is turned on. Select the desired network from the list of available networks, enter the password if required, and tap 'Connect.' You can also adjust other network-related settings like VPN and proxy settings in this section. For Bluetooth, display, or sound settings, navigate to the respective sections."

Key Difference:

The task-based chunking answer provides a clear, step-by-step process focused solely on connecting to Wi-Fi, making it easier for the user to follow and accomplish the specific task.
The section-based chunking answer includes broader context and related settings, which could overwhelm a user looking for a simple Wi-Fi connection.

10. Increased Processing Costs

Inefficient chunking can lead to increased computational costs, as more chunks may need to be processed to cover the same amount of information. On the other hand, advanced chunking strategies, such as those based on semantic content or topic modeling, can also be computationally expensive and slow.?

Simple Chunking (Sentence-Based)

Semantic /Smart Chunking

Inexpensive to implement and has a low processing cost. It's often used for unstructured documents and is suitable for a wide range of text-based documents.

Semantic chunking methods, such as semantic percentile-based and double-pass merging, incur higher costs due to their complexity and computational requirements.

Processing cost for sentence-based chunking is effectively zero, as indicated in the search results.

The difference in costs between simple and advanced chunking methods can be substantial and orders of magnitude more expensive, especially when using sophisticated models like GPT-4.

Unstructured’s? Smart Chunking takes into account the semantic structure and content of the documents offering several strategies which differ in how they guarantee the purity of content within chunks.

As their blogpost concludes,? “Chunking is one of the essential preprocessing steps in any RAG system. The choices you make when you set it up, will influence the retrieval quality, and as a consequence, the overall performance of the system.”

Hopefully, you found this post useful. I encourage you to read and follow unstructured.io - one of the leaders in this space.

#UnstructuredData #DataManagement #RAGModel #DataRetrieval #AIandData #DataInnovation #NaturalLanguageProcessing #AIAutomation #KnowledgeManagement #DataScience #AIResearch #DataDrivenAI #TextMining #SemanticSearch #AIModels #DataTransformation #InformationRetrieval #MachineLearning #DataStrategy #AIBlog

Christopher Maddock Drew Messersmith Andrew Zane Brian S. Raymond

要查看或添加评论，请登录

Raja Rao DV的更多文章

Contagious: Why Things Catch On | Actionable Insights

2023年1月14日

Contagious: Why Things Catch On | Actionable Insights

Here is a quick 2-minutes summary and actionable insights into the book Contagious: Why Things Catch On The book talks…

1 条评论
How To Master Almost Anything | Actionable Insights in 2-minutes

2023年1月9日

How To Master Almost Anything | Actionable Insights in 2-minutes

Many people with 10–20 years of experience in a given field are no better than people with 2–4 years of experience…

1 条评论
The Pomodoro Technique | 2-minutes Actionable Insights

2023年1月8日

The Pomodoro Technique | 2-minutes Actionable Insights

I read the classic Pomodoro Technique book and below are the actionable insights. You only have 1440 minutes per day.
The 22 Laws of Marketing | Actionable Insights

2023年1月3日

The 22 Laws of Marketing | Actionable Insights

Here is a summary and some actionable insights from the "The 22 immutable laws of marketing" book. Hope you'll find it…
How To Stop Worrying And Start Living | Actionable Insights

2023年1月2日

How To Stop Worrying And Start Living | Actionable Insights

One of the best books I've read is the "How to stop worrying and start living" by Dale Carnegie. I highly recommend…

1 条评论
The Anatomy Of A React Redux App

2016年3月24日

The Anatomy Of A React Redux App

Check out my new post "The Anatomy Of A React Redux App" where I talk about 6 types of components in Redux apps.

1 条评论
Middlewares And React Redux Life Cycle

2016年3月22日

Middlewares And React Redux Life Cycle

Check out my blog post on Medium here: Middlewares And React Redux Life Cycle

1 条评论
A Guide For Building A React Redux CRUD App

2016年3月22日

A Guide For Building A React Redux CRUD App

Check out my blog post on Medium here.
Step by Step Guide To Building React Redux Apps

2016年3月22日

Step by Step Guide To Building React Redux Apps

Check it out on my Medium blog here

See all articles

10 Reasons Why Chunking for AI or RAG is Hard (with examples)

Raja Rao DV

CEO of a stealth startup Former VP of Growth @ Redis, Semgrep, Applitools; Ex Salesforce, VMware and Yahoo!

????Let’s look at the top 10 challenges with Chunking.?

1. Context Preservation?

2. Handling Diverse Document Structures

Improper Chunking Example

Let’s ask a question to the AI..”"Which marketing strategy was most successful?"

AI’s Potential Answer with Improper Chunking:

3 Balancing Chunk Sizes

?

?Improper Chunking Example

4. Semantic Coherence?

Improper Chunking Example:

Question:

领英推荐

Potential AI Response with Improper Chunking:

5. Overlapping of Information?

Original Text:

Why It’s a Problem

6. Boundary Detection?

Improper Chunking Example:

AI’s Potential Answer with Improper Chunking:

7. Handling Non-Textual Data?

8. Impact on Retrieval Performance:?

9. Balancing Granularity and Generalization?

10. Increased Processing Costs

Raja Rao DV的更多文章

社区洞察

其他会员也浏览了

Siberian Husky: The Majestic Adventurer

German Shepherds: The Ultimate Utility Dog

Indie Dog Breed: Overview, Diet, Nature & Adoption

Unfolding insights about Labrador Retriever

Different skull types of cats and dogs - what this can tell you

SERVIVAL - DON'T CROSS THE LINES.

Meet 15 Furry Animals at the World’s Most Pet-Friendly Company

15 Most Popular Types of Pet Parrots

Where to Buy Teacup Puppies: A Complete Guide

Types of Lovebirds (9 Lovebirds Species)

????Let’s look at the top 10 challenges with Chunking.?

1. Context Preservation?

2. Handling Diverse Document Structures

Improper Chunking Example

Let’s ask a question to the AI..”"Which marketing strategy was most successful?"

AI’s Potential Answer with Improper Chunking:

3 Balancing Chunk Sizes

?

?Improper Chunking Example

4. Semantic Coherence?

Improper Chunking Example:

Question:

领英推荐

Potential AI Response with Improper Chunking:

5. Overlapping of Information?

Original Text:

Why It’s a Problem

6. Boundary Detection?

Improper Chunking Example:

AI’s Potential Answer with Improper Chunking:

7. Handling Non-Textual Data?

8. Impact on Retrieval Performance:?

9. Balancing Granularity and Generalization?

10. Increased Processing Costs

Raja Rao DV的更多文章

Contagious: Why Things Catch On | Actionable Insights

How To Master Almost Anything | Actionable Insights in 2-minutes

The Pomodoro Technique | 2-minutes Actionable Insights

The 22 Laws of Marketing | Actionable Insights

How To Stop Worrying And Start Living | Actionable Insights

The Anatomy Of A React Redux App

Middlewares And React Redux Life Cycle

A Guide For Building A React Redux CRUD App

Step by Step Guide To Building React Redux Apps

社区洞察

其他会员也浏览了

Siberian Husky: The Majestic Adventurer

German Shepherds: The Ultimate Utility Dog

Indie Dog Breed: Overview, Diet, Nature & Adoption

Unfolding insights about Labrador Retriever

Different skull types of cats and dogs - what this can tell you

SERVIVAL - DON'T CROSS THE LINES.

Meet 15 Furry Animals at the World’s Most Pet-Friendly Company

15 Most Popular Types of Pet Parrots

Where to Buy Teacup Puppies: A Complete Guide

Types of Lovebirds (9 Lovebirds Species)