The Effects of Data Noise on the Efficiency of Vector Search Algorithms
Data cleaning is the task every data scientist loves to hate but can’t live without. Like a reliable shower, it’s not glamorous, but it’s essential if you don’t want to stink up the joint.?
This blog post explores the impact of noise on vector search performance. It uses Python code to calculate the cosine similarity difference between vectors of noisy vs. clean data.
The Hypothesis: Clean Data, Better Vectors
Before we dive into the nitty-gritty, let’s establish the premise: Noise in text data negatively impacts the performance of vector search. In simpler terms, garbage in equals garbage out.?
But if we clean up that garbage, we might find some treasure. I ran a series of experiments to test this hypothesis, comparing noisy vs. clean text data performance in a vector search scenario.
Experiment Setup
For this experiment, I used a series of clean and noisy text samples. The noise came in various flavors: HTML tags, URLs, PII (personally identifiable information), excessive punctuation, and the dreaded mixed casing.?
Each sample was paired with a relevant query, and I generated the embeddings using a pre-trained model (`all-MiniLM-L6-v2`) from the sentence-transformers library.
The Queries and Embedding Dataset
Here's a breakdown of the queries and the embedding dataset used for analysis. I have only provided sample chunks here. To get the complete data.?
Linkedin does not support tables; we'll have to get by interpreting from screenshots instead.
List of sample chunks we cleaned before generating embeddings:
List of queries against noisy and clean data:
The noisy samples contain various forms of noise, while the clean samples have been stripped of unnecessary content to enhance clarity and relevance.
The Code Breakdown
This code helped generate the insights we're about to explore.
The complete sample code can be found on GitHub. Feel free to copy, paste, and tinker with it in your next project.
Output: Chart of Similarity Difference Between Embeddings of Noisy Vs. Clean Data
Below is a snapshot of the outcome table that summarizes the results of this experiment:
This table shows how similarity scores shifted between noisy and clean data, highlighting areas where cleaning had a significant impact and where it didn't.
领英推荐
The Results: Where the Rubber Meets the Road
In the experiments, we saw varied effects of data cleaning on vector search performance. Some clean text became significantly more relevant to the query, while in other cases, relevance decreased.?
In addition, there were scenarios where cleaning made little difference in similarity but still provided other benefits, such as reducing token usage. Here’s a breakdown of what we found:
1. Significant Similarity Increases
In cases where the similarity score increased substantially (e.g., +0.06), cleaning the text made it significantly more relevant to the query. This indicates that noise was detracting from the model's ability to focus on the core content.
Example: Query - "How noise impacts retrieval models"
2. Significant Similarity Decreases
In some cases, cleaning reduced the text's relevance to the query. This suggests that certain noisy elements, like specific details (e.g., contact information or numerical data), may have contributed to the relevance in ways that weren't obvious initially.
Example: Query - "Improving machine learning model accuracy"
3. Minimal Differences in Similarity Scores
In some instances, cleaning didn’t drastically change the similarity score. This might seem insignificant initially, but it offers substantial benefits, especially in enterprise use cases. Lesser tokens processed by the embedding model and LLMs mean reduced costs and improved efficiency.?
Even a tiny reduction in token usage can make a huge difference when scaled across millions of transactions.
Example: Query - "Standardizing data for text preprocessing"
Benefits of Cleaning Text for Enterprise Applications
Even when cleaning doesn’t directly improve relevance, it offers substantial benefits for enterprise applications:
Recommendations: When to Clean and When to Skip
When to Clean:
When to Skip Cleaning:
Cleaning can significantly impact your projects, whether you want to improve relevance, reduce token usage, or comply with regulations. But, like any tool, knowing when and how to use it is essential.?
Hopefully, this post gave you valuable insights into how to approach data cleaning for your RAG projects. If you found this useful or have any questions, please let me know in the comment below!