10 Reasons Why Chunking for AI or RAG is Hard (with examples)
Raja Rao DV
CEO of a stealth startup Former VP of Growth @ Redis, Semgrep, Applitools; Ex Salesforce, VMware and Yahoo!
“Chunking” refers to breaking down documents into bite-sized chunks so LLMs can retrieve precise and high-quality information. However, it's really hard.
But first, let's understand why do we even need to do chunking in the first place?
Please note that to make things easier, I've created screenshots of the text with highlights to illustrate the chunks and their issues for each section. If you reading the screenshots, you can skip most of the text.
OK, let’s take an example to understand this. Let’s say you have a book with 3 chapters that you want to feed to AI and then ask questions about it. For now, let’s assume that each chapter has just one paragraph for simplicity.
Chapter 1: Dogs "Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Regular health check-ups are necessary to keep them healthy."
Chapter 2: Cats "Cats are independent and curious animals. Common breeds include Siamese, Persian, and Maine Coon. Cats often prefer to be left alone but enjoy playing with toys. They are usually litter trained, and it's important to monitor their diet for health."
Chapter 3: Birds "Birds are colorful and social creatures. Popular pet birds include Parakeets, Canaries, and Cockatiels. They need a balanced diet of seeds and fruits. Birds also require a clean cage and regular interaction to stay happy and healthy."
Let’s say we fed the whole thing as-is to AI. i.e. the whole book is just 1 chunk.
Improper Chunk: "Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Regular health check-ups are necessary to keep them healthy. Cats are independent and curious animals. Common breeds include Siamese, Persian, and Maine Coon. Cats often prefer to be left alone but enjoy playing with toys. They are usually litter trained, and it's important to monitor their diet for health. Birds are colorful and social creatures. Popular pet birds include Parakeets, Canaries, and Cockatiels. They need a balanced diet of seeds and fruits. Birds also require a clean cage and regular interaction to stay happy and healthy."
Let’s ask a question to our AI “tell me something about dogs”
Question:
"Tell me something about dogs."
AI’s Potential Answer with Improper Chunking:
"Dogs are loyal and friendly pets. They come in various breeds, such as Labrador Retrievers, German Shepherds, and Beagles. Training is important for dogs to ensure they behave well. Cats are independent and curious animals. Popular pet birds include Parakeets, Canaries, and Cockatiels."
Problem:
But for our example, if we were to break it up into 3 chunks, one for each chapter, it'd have worked (at least for our example and our specific question).
Again, this only works for this specific question and assumes that the chapter only talks about one animal. For example, if the question was "Tell me the differences between cats and dogs", then this chapter-wise chunking may not have worked.
This is why chunking is hard. You really need to look into how your document is written, and anticipate various questions to ensure you use the right strategies.
????Let’s look at the top 10 challenges with Chunking.?
1. Context Preservation?
Context preservation is a significant challenge in chunking for AI because it involves breaking down large volumes of information into smaller, manageable parts, which can lead to the loss of overall context.?
When information is segmented into smaller chunks, the connections between these parts may become obscured, leading to misunderstandings or misinterpretations.
Let’s take an example of a paragraph of text and see how improper chunking can result in incorrect response.
Original Paragraph:
"Greenhouse gasses, such as carbon dioxide and methane, trap heat in the Earth's atmosphere, leading to a warming effect known as global warming. This warming effect is responsible for melting polar ice caps, rising sea levels, and more extreme weather patterns around the world.???
Suppose we use fixed-size chunking, and the paragraph is split into two chunks like this:
Improper Chunking Example:
Chunk 1: "Greenhouse gasses, such as carbon dioxide and methane, trap heat in the Earth's atmosphere, leading to a warming effect known as global warming."
Chunk 2: "This warming effect is responsible for melting polar ice caps, rising sea levels, and more extreme weather patterns around the world."
Impact of Improper Chunking
Now, imagine you ask the AI the following question:
AI Question: "How do greenhouse gasses contribute to extreme weather patterns?"
What the AI Might Do:
Result: The AI’s response could be incomplete or confusing because it’s missing the
connection between the greenhouse gasses and their impact on the climate.
A common approach is fixed-size chunking, which breaks text into uniformly sized pieces based on a predefined number of characters, words, or tokens. It's simple to implement and computationally efficient, but it can cut off important semantic boundaries as it is done by an arbitrary character count.
As an example:
unstructured.io , one of the leaders in this space, performs chunking using logical, contextual boundaries—a technique known as Smart Chunking. This allows more relevant segments of data to be retrieved and passed as context to the LLM. Check out this blog post to understand the various strategies Unstructured uses to arrive at Smart Chunking.
2. Handling Diverse Document Structures
Handling diverse document structures is a problem in chunking for AI because different types of documents require different chunking strategies to maintain their semantic integrity and usability.
Original Document (contains both the following text and the table)
Text: "Our marketing team implemented three key strategies this year to increase customer engagement and sales. The first strategy focused on social media campaigns, the second on email marketing, and the third on influencer partnerships. Below is a table showing the success rates of each strategy in terms of conversion rates."
Improper Chunking Example
Let’s assume fixed-size chunking splits the document into the following two chunks:
Chunk 1 (Text Only): "Our marketing team implemented three key strategies this year to increase customer engagement and sales. The first strategy focused on social media campaigns, the second on email marketing, and the third on influencer partnerships."
Chunk 2 (Table Only): This is the table's data
Let’s ask a question to the AI..”"Which marketing strategy was most successful?"
AI’s Potential Answer with Improper Chunking:
As an example, this Unstructured blogpost notes, “The effectiveness of RAG architectures are closely tied to how well models can retrieve information relevant to a prompt, that is stored in an external database. As RAG has become more widely adopted, developers have typically treated documents as a stream of text, and not accounted for the nuanced relationships between different types of elements such as titles, tables, and body text. We have found, however, that the performance of these architectures on information retrieval and Q&A generation with more sophisticated document preprocessing.”
3 Balancing Chunk Sizes
Balancing chunk sizes in chunking for AI is a complex problem due to several factors that need to be considered to optimize the performance and accuracy of AI applications.
Imagine a legal document discussing the terms and conditions of a contract. Here’s an excerpt that spans multiple paragraphs.
Original Legal Document Excerpt:
Paragraph 1: "The parties agree that the seller will deliver the goods to the buyer within 30 days of the purchase date. The seller shall bear all costs associated with the delivery. In the event of any delays, the buyer reserves the right to cancel the contract."
Paragraph 2: "Furthermore, the buyer is entitled to a full refund if the goods are not delivered in the condition agreed upon in this contract. The seller must notify the buyer of any potential issues with the delivery at least 5 days in advance."
Paragraph 3: "Any disputes arising from this contract shall be resolved through arbitration, and both parties agree to abide by the decision of the arbitrator. The arbitration process shall take place in the jurisdiction where the buyer is located."
Let's walk through a specific example to illustrate the challenge of balancing chunk sizes in the context of a legal document.
?
?Improper Chunking Example
Scenario 1: Chunking Too Small Let’s say we chunk the document into very small pieces, such as by sentence:
Impact of Chunking Too Small
Lack of Context: If the AI retrieves only Chunk 3 ("In the event of any delays, the buyer reserves the right to cancel the contract."), it might miss the conditions under which this cancellation can occur, leading to an incomplete understanding.
Fragmented Information: The AI may struggle to connect the small pieces of information, especially if it retrieves non-consecutive chunks. For example, retrieving Chunk 4 (about the refund) without Chunk 1 (delivery timing) loses the relationship between delivery issues and the right to a refund.
Scenario 2: Chunking Too Large Now, let’s chunk the document into one large piece:
Chunk 1: "The parties agree that the seller will deliver the goods to the buyer within 30 days of the purchase date. The seller shall bear all costs associated with the delivery. In the event of any delays, the buyer reserves the right to cancel the contract. Furthermore, the buyer is entitled to a full refund if the goods are not delivered in the condition agreed upon in this contract. The seller must notify the buyer of any potential issues with the delivery at least 5 days in advance. Any disputes arising from this contract shall be resolved through arbitration, and both parties agree to abide by the decision of the arbitrator. The arbitration process shall take place in the jurisdiction where the buyer is located."
Impact of Chunking Too Large
4. Semantic Coherence?
Semantic coherence is a problem in chunking for AI because it involves ensuring that each chunk of text maintains a meaningful and contextually relevant segment of the original document.
Original News Article Excerpt:
"A major storm hit the city yesterday, causing widespread damage. The storm knocked down power lines, leaving thousands without electricity. In response, the local government has declared a state of emergency and is working to restore power as quickly as possible."
Improper Chunking Example:
Suppose we use fixed-size chunking that splits the article into two chunks based on word count:
Chunk 1: "A major storm hit the city yesterday, causing widespread damage. The storm knocked down power lines, leaving thousands without electricity."
Chunk 2: "In response, the local government has declared a state of emergency and is working to restore power as quickly as possible."
Question:
"What actions did the local government take in response to the storm?"
领英推荐
Potential AI Response with Improper Chunking:
If the AI only retrieves Chunk 1: The AI might say, "The storm knocked down power lines, leaving thousands without electricity," but it won’t mention what the government did in response because that information is in Chunk 2.
If the AI only retrieves Chunk 2: The AI might say, "The local government declared a state of emergency and is working to restore power," but it might not clearly connect this action to the storm because the storm is only described in Chunk 1.
Read this blog post to learn more.
5. Overlapping of Information?
Overlapping of information is a problem in chunking for AI because it can lead to inefficiencies and inaccuracies in data processing and retrieval. Overlapping chunks may contain repeated information, which can lead to redundancy. This redundancy can increase storage requirements and computational costs, as the same information is processed multiple times. Overlapping information can create confusion in maintaining the context of the data.While some overlap might be necessary to preserve context across chunks, finding the right balance is crucial.?
Let's consider an example involving overlapping chunks of information:?
Original Text:
"A car engine works by converting fuel into energy. The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons. This movement of the pistons turns the crankshaft, which ultimately powers the car's wheels."
Overlapping Chunking Example:
To maintain context, you might create two overlapping chunks:
Chunk 1: "A car engine works by converting fuel into energy. The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons."
Chunk 2: "The fuel is ignited in the engine's cylinders, causing small explosions that push the pistons. This movement of the pistons turns the crankshaft, which ultimately powers the car's wheels."
Duplication Issue
Why It’s a Problem
6. Boundary Detection?
Boundary detection is a problem in chunking for AI due to several challenges associated with accurately identifying where to split text while maintaining semantic coherence and relevance.
For example:
Original Recipe Instructions:"To make the cake, first mix the flour, sugar, and eggs in a large bowl until smooth. Then, pour the mixture into a baking pan and bake at 350°F for 30 minutes."
Improper Chunking Example:
Imagine the instructions are split into two chunks based on a fixed word count:
Chunk 1: "To make the cake, first mix the flour, sugar, and eggs in a large bowl until smooth."
Chunk 2: "Then, pour the mixture into a baking pan and bake at 350°F for 30 minutes."
Question to AI:
"How do I bake the cake?"
AI’s Potential Answer with Improper Chunking:
If the AI retrieves only Chunk 1: "To bake the cake, first mix the flour, sugar, and eggs in a large bowl until smooth."
If the AI retrieves only Chunk 2: "Pour the mixture into a baking pan and bake at 350°F for 30 minutes."
7. Handling Non-Textual Data?
Handling non-textual data is a problem in chunking for AI due to several challenges related to the diverse nature and structure of such data. Non-textual data includes images, audio, video, and other multimedia formats, each with its own structure and characteristics. Chunking strategies that work for text may not be directly applicable to these formats, requiring specialized approaches to effectively segment and process them.
For example:
Text: "A recent study investigated the impact of different diets on weight loss. Participants followed either a low-carb diet or a low-fat diet for six months. The table below shows the average weight loss for each group."
Table:
Improper Chunking Example:
Imagine the text and the table are chunked separately:
Chunk 1 (Text Only): "A recent study investigated the impact of different diets on weight loss. Participants followed either a low-carb diet or a low-fat diet for six months."
Chunk 2 (Table Only): Table data
Question to AI:
"How effective was the low-carb diet in the study?"
AI’s Potential Answer with Improper Chunking:
If the AI retrieves only Chunk 1: "The study investigated the impact of different diets on weight loss, including a low-carb diet." (But it doesn’t provide the specific results.)
If the AI retrieves only Chunk 2: "The average weight loss for the low-carb diet was 10 kg." (But it doesn’t explain that this data is from a study comparing different diets.)
8. Impact on Retrieval Performance:?
Suppose you’re searching for information about a company’s financial performance in a large document. If the document was chunked poorly, you might get a chunk that mentions revenue but not the context or explanation of why it changed. This can lead to incomplete retrieval of answers, making the search less effective.
Let's imagine you're searching for information about a company's financial performance in a large annual report. The report is divided into chunks, but the chunking was done poorly, leading to fragmented information. Here's an example of how this might look:
Improper Chunking Example:
Chunk 1: "In 2023, the company's revenue increased by 15% compared to the previous year."
Chunk 2:"The strategic initiatives implemented in the second quarter played a significant role in driving this growth. These initiatives included expanding into new markets, launching innovative products, and enhancing customer engagement through digital platforms."
Question:? “How did Company X perform during the Financial year 2023?
AI’s Potential Answer with Improper Chunking:
In this example, if you only have access to Chunk 1, you would know that the revenue increased by 15% in 2023, but you wouldn't understand why this change occurred. The context and explanation for the revenue increase are provided in Chunk 2, which discusses strategic initiatives like market expansion and product launches. Without both chunks, your understanding of the company's financial performance is incomplete.
9. Balancing Granularity and Generalization?
Balancing granularity and generalization in chunking for AI is a problem due to the need to optimize the trade-off between detailed segmentation and maintaining broader context. Achieving the right level of granularity is crucial to preserving the semantic integrity of the data. If chunks are too granular, they may lose important contextual information, making it difficult for AI systems to understand the broader meaning. Conversely, if chunks are too generalized, they may include irrelevant information, reducing the precision of the AI's output.
For example, imagine an article that talks about how to setup Wifi. Let's see how the two choices affect:
Option 1: Detailed Task-Based Chunking
Option 2: Section-Based Chunking
Example Question:
"How do I connect my smartphone to a Wi-Fi network?"
AI's Potential Answer (Using Detailed Task-Based Chunking):
"1. Open the Settings app on your smartphone. 2. Tap on 'Wi-Fi' from the list of options. 3. Ensure Wi-Fi is turned on by toggling the switch. 4. Select your desired Wi-Fi network from the list. 5. Enter the network password if prompted. 6. Tap 'Connect.'"
AI's Potential Answer (Using Section-Based Chunking):
"To connect to a Wi-Fi network, go to the 'Settings' section of your smartphone. Under 'Wi-Fi Settings,' make sure Wi-Fi is turned on. Select the desired network from the list of available networks, enter the password if required, and tap 'Connect.' You can also adjust other network-related settings like VPN and proxy settings in this section. For Bluetooth, display, or sound settings, navigate to the respective sections."
Key Difference:
10. Increased Processing Costs
Inefficient chunking can lead to increased computational costs, as more chunks may need to be processed to cover the same amount of information. On the other hand, advanced chunking strategies, such as those based on semantic content or topic modeling, can also be computationally expensive and slow.?
Simple Chunking (Sentence-Based)
Semantic /Smart Chunking
Inexpensive to implement and has a low processing cost. It's often used for unstructured documents and is suitable for a wide range of text-based documents.
Semantic chunking methods, such as semantic percentile-based and double-pass merging, incur higher costs due to their complexity and computational requirements.
Processing cost for sentence-based chunking is effectively zero, as indicated in the search results.
The difference in costs between simple and advanced chunking methods can be substantial and orders of magnitude more expensive, especially when using sophisticated models like GPT-4.
Unstructured’s? Smart Chunking takes into account the semantic structure and content of the documents offering several strategies which differ in how they guarantee the purity of content within chunks.
As their blogpost concludes,? “Chunking is one of the essential preprocessing steps in any RAG system. The choices you make when you set it up, will influence the retrieval quality, and as a consequence, the overall performance of the system.”
Hopefully, you found this post useful. I encourage you to read and follow unstructured.io - one of the leaders in this space.
#UnstructuredData #DataManagement #RAGModel #DataRetrieval #AIandData #DataInnovation #NaturalLanguageProcessing #AIAutomation #KnowledgeManagement #DataScience #AIResearch #DataDrivenAI #TextMining #SemanticSearch #AIModels #DataTransformation #InformationRetrieval #MachineLearning #DataStrategy #AIBlog