Prompt Engineering - Chunking Strategies
LLMs' text-to-text format allows us to tackle a broad spectrum of tasks. This potential was first glimpsed in the proposal of the revolutionary GPT-3, which demonstrated that substantial language models could leverage few-shot learning to solve numerous tasks with impressive accuracy. However, as research on LLMs advanced, we began to explore more sophisticated prompting techniques beyond zero/few-shot learning.
The Need for Advanced Prompting Techniques
Instruction-following LLMs such as InstructGPT and ChatGPT led us to investigate the extent of tasks that LLMs could handle. The goal was to move beyond elementary problems and employ LLMs for more complex tasks. For the LLMs to be practically useful, they needed to be capable of following complex instructions and performing multi-step reasoning to answer intricate questions accurately. Unfortunately, basic prompting techniques could not solve such complex problems. Hence, the need for more sophisticated methods like chunking emerged.
The Power of Chunking
In the context of building LLM-related applications, chunking refers to the process of dividing large text segments into smaller, more manageable parts. This strategy is crucial for optimizing the relevance of the content we retrieve from a vector database once we use an LLM to embed content.
In semantic search, we index a collection of documents, each containing valuable information on a specific topic. By using an effective chunking strategy, we can ensure our search results accurately capture the essence of the user's query. If our chunks are too small or too large, it may lead to imprecise search results or overlooked opportunities to surface relevant content.
Short and Long Content Embedding
When embedding our content, we can expect different behaviors based on whether the content is short (like sentences) or long (like paragraphs or entire documents).
Embedding a sentence focuses on the sentence's precise meaning. Comparisons would naturally be done at that level when compared to other sentence embeddings. However, the embedding might miss out on broader contextual information found in a paragraph or document.
On the other hand, when a full paragraph or document is embedded, the embedding process considers both the overall context and the relationships between the sentences and phrases within the text. This can result in a more comprehensive vector representation that captures the broader meaning and themes of the text. However, larger input text sizes may introduce noise or diminish the significance of individual sentences or phrases, making it more challenging to find precise matches when querying the index.
Chunking Considerations
Determining the optimal chunking strategy depends on various factors that vary based on the use case. Here are some key considerations:
Different Chunking Methods
There are several chunking methods, each suitable for different situations. Let's explore each of them:
Phrase-Based Chunking:
What is it?
Phrase-based chunking involves dividing text into phrases or multi-word units (MWUs) based on linguistic patterns, context, and semantic relationships. This approach recognizes that words often co-occur in specific ways to convey meaning, making phrases a fundamental unit of language.
How does it work?
1. Tokenization: Break down the input text into individual tokens (words or subwords).
2. Pattern recognition: Analyze token sequences using linguistic patterns, such as:
* Syntactic structures: Identify phrase boundaries based on grammatical relationships between words.
* Semantic roles: Recognize phrases that play specific semantic roles in a sentence (e.g., subject, object, modifier).
3. Contextual analysis: Consider the surrounding context to disambiguate ambiguous tokens and refine phrase boundaries.
Examples
1. "The quick brown fox" is a single phrase with a clear meaning.
2. In the phrase "in the morning" has a specific semantic role (temporal specification) in a sentence.
3. The phrase "New York City" refers to a specific location, rather than being two separate entities.
Benefits
1. Captures complex relationships: Phrase-based chunking can identify nuanced relationships between words that might be missed by simpler tokenization approaches.
2. Improves information retrieval: By recognizing phrases as meaningful units, you can enhance search results and retrieve more relevant documents.
3. Enhances text analysis: This approach enables better sentiment analysis, entity recognition, and topic modeling.
Dependency-Based Chunking:
What is it?
Dependency-based chunking involves analyzing sentence structure by identifying dependencies between words (dependencies) to group them into meaningful units.
How does it work?
1. Tokenization: Break down the input text into individual tokens.
2. Dependency parsing: Analyze token sequences using dependency relationships, such as:
* Subject-Verb-Object (SVO) structures
* Modifiers and their modified words
3. Chunking: Group tokens based on their dependencies to form meaningful units.
Examples
1. "The dog chased the cat" has a clear SVO structure: [Subject] The dog [ Verb ] chased [ Object ] the cat.
2. In the sentence "John gave Mary a book", we can identify dependencies between words:
领英推荐
* John (subject) → gave (verb) → Mary (object)
* Mary (object) → a (article) → book (noun)
Benefits
1. Captures complex grammatical relationships: Dependency-based chunking recognizes intricate sentence structures, enabling better text analysis and understanding.
2. Improves machine translation: By analyzing dependencies between words, you can enhance machine translation systems to produce more accurate translations.
3. Enhances question answering: This approach enables better identification of answerable questions by recognizing the relationships between words in a query.
Fixed-size Chunking
This is the most common and straightforward approach to chunking. We decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. Here's an example of performing fixed-size chunking with LangChain:
text = "..." # your text
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
separator = "\n\n",
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Content-Aware Chunking
These methods take advantage of the nature of the content we're chunking and apply more sophisticated chunking to it. Here are some examples:
Sentence Splitting
Many models are optimized for embedding sentence-level content. We can use sentence chunking, and there are several approaches and tools available to do this, including:
text = "..." # your text
docs = text.split(".")
text = "..." # your text
from langchain.text_splitter import NLTKTextSplitter
text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)
text = "..." # your text
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpaCyTextSplitter()
docs = text_splitter.split_text(text)
Recursive Chunking
Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner. If the initial attempt at splitting the text doesn't produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved.
text = "..." # your text
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
Specialized Chunking
Markdown and LaTeX are two examples of structured and formatted content you might run into. In these cases, you can use specialized chunking methods to preserve the original structure of the content during the chunking process.
from langchain.text_splitter import MarkdownTextSplitter
markdown_text = "..."
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
from langchain.text_splitter import LatexTextSplitter
latex_text = "..."
latex_splitter = LatexTextSplitter(chunk_size=100, chunk_overlap=0)
docs = latex_splitter.create_documents([latex_text])
Determining the Optimal Chunk Size
Determining an optimal chunk size for your use case can be challenging. Here are some guidelines that might help:
Conclusion
Chunking your content is simple in most cases - but it could pose some challenges when you start venturing off the beaten path. There's no one-size-fits-all solution to chunking, so what works for one use case may not work for another. Understanding the different chunking methods and considerations can help you craft a more effective strategy for your specific needs, ensuring that your language model performs at its best.
P.S: Some more chunking strategies to ponder over
Token Pruning: Remove unnecessary filler words, punctuations, or repetitive phrases from the context to save tokens.
Adaptive Token Limit: Dynamically adjust the token limit for the model based on the complexity of the incoming query, allowing for more tokens for more complex queries.
Summary Injection: Use summarized versions of earlier conversation context to reintroduce important details without using too many tokens.
Priority-based Inclusion: For long conversations, include only the most crucial parts of the conversation in the current chunk, perhaps based on keyword importance.
Response Trimming: After generating a response, trim unnecessary tokens before sending it back to the user to save space for future interactions.
Parallel Chunking: For queries that can be broken down into smaller independent sub-queries, process multiple chunks in parallel and then aggregate the results.
Conditional Chunking: Implement logic to determine when chunking is necessary; e.g., simple queries might not need chunking, saving computational resources.
Pros and cons of using GPT-4 8K vs GPT-4 32K model in prompt engineering in a table. When Chunking is used a strategy also include the costs and which one helps in cost optimization. For cost optimization, using GPT-4 8K with chunking is generally more efficient because it consumes fewer tokens per prompt, making it more cost effective.