The Effects of Data Noise on the Efficiency of Vector Search Algorithms
Data cleaning for vector search

The Effects of Data Noise on the Efficiency of Vector Search Algorithms

Data cleaning is the task every data scientist loves to hate but can’t live without. Like a reliable shower, it’s not glamorous, but it’s essential if you don’t want to stink up the joint.?

This blog post explores the impact of noise on vector search performance. It uses Python code to calculate the cosine similarity difference between vectors of noisy vs. clean data.

The Hypothesis: Clean Data, Better Vectors

Before we dive into the nitty-gritty, let’s establish the premise: Noise in text data negatively impacts the performance of vector search. In simpler terms, garbage in equals garbage out.?

But if we clean up that garbage, we might find some treasure. I ran a series of experiments to test this hypothesis, comparing noisy vs. clean text data performance in a vector search scenario.

Experiment Setup

For this experiment, I used a series of clean and noisy text samples. The noise came in various flavors: HTML tags, URLs, PII (personally identifiable information), excessive punctuation, and the dreaded mixed casing.?

Each sample was paired with a relevant query, and I generated the embeddings using a pre-trained model (`all-MiniLM-L6-v2`) from the sentence-transformers library.

The Queries and Embedding Dataset

Here's a breakdown of the queries and the embedding dataset used for analysis. I have only provided sample chunks here. To get the complete data.?

Linkedin does not support tables; we'll have to get by interpreting from screenshots instead.

List of sample chunks we cleaned before generating embeddings:

Sample data with noise


List of queries against noisy and clean data:

Search queries used in the experiment

The noisy samples contain various forms of noise, while the clean samples have been stripped of unnecessary content to enhance clarity and relevance.

The Code Breakdown

This code helped generate the insights we're about to explore.

The complete sample code can be found on GitHub. Feel free to copy, paste, and tinker with it in your next project.

Python code to calculate the difference in Cosine similarity between noisy and clean data


Output: Chart of Similarity Difference Between Embeddings of Noisy Vs. Clean Data

Below is a snapshot of the outcome table that summarizes the results of this experiment:

This table shows how similarity scores shifted between noisy and clean data, highlighting areas where cleaning had a significant impact and where it didn't.

Output from running the code.

The Results: Where the Rubber Meets the Road

In the experiments, we saw varied effects of data cleaning on vector search performance. Some clean text became significantly more relevant to the query, while in other cases, relevance decreased.?

In addition, there were scenarios where cleaning made little difference in similarity but still provided other benefits, such as reducing token usage. Here’s a breakdown of what we found:

1. Significant Similarity Increases

In cases where the similarity score increased substantially (e.g., +0.06), cleaning the text made it significantly more relevant to the query. This indicates that noise was detracting from the model's ability to focus on the core content.

Example: Query - "How noise impacts retrieval models"

  • Noisy Similarity: 0.509
  • Clean Similarity: 0.639
  • Difference: +0.130 Matched Text (Noisy): <p>Welcome to our site!!! Visit <a href="https://example.com">this link</a> for more information... In this case, cleaning the text—removing HTML tags, URLs, and excessive punctuation—allowed the model to focus on the critical content of data quality and retrieval models, significantly improving relevance.

2. Significant Similarity Decreases

In some cases, cleaning reduced the text's relevance to the query. This suggests that certain noisy elements, like specific details (e.g., contact information or numerical data), may have contributed to the relevance in ways that weren't obvious initially.

Example: Query - "Improving machine learning model accuracy"

  • Noisy Similarity: 0.499
  • Clean Similarity: 0.415
  • Difference: -0.084 Matched Text (Noisy): Machine learning is transforming industries. Machine learning is transforming industries... Here, the repetition and detail in the noisy text contributed to a higher similarity with the query. Cleaning removed some of these details, making the text less relevant.

3. Minimal Differences in Similarity Scores

In some instances, cleaning didn’t drastically change the similarity score. This might seem insignificant initially, but it offers substantial benefits, especially in enterprise use cases. Lesser tokens processed by the embedding model and LLMs mean reduced costs and improved efficiency.?

Even a tiny reduction in token usage can make a huge difference when scaled across millions of transactions.

Example: Query - "Standardizing data for text preprocessing"

  • Noisy Similarity: 0.486
  • Clean Similarity: 0.484
  • Difference: -0.003 Matched Text (Noisy): Text CLEANING!!! is CRITICAL for IMPROVING machine learning MODEL accuracy!!!... In this scenario, cleaning didn't affect the relevance much. Still, it did reduce the token count by removing excessive punctuation and unnecessary emphasis, which is a win for optimizing token usage in large-scale applications.

Benefits of Cleaning Text for Enterprise Applications

Even when cleaning doesn’t directly improve relevance, it offers substantial benefits for enterprise applications:

  1. Cost Reduction: Reducing token usage in embeddings and LLMs can lead to significant cost savings, mainly when processing large amounts of data.
  2. Efficiency Gains: Cleaning helps streamline the data, making it faster to process and reducing the computational load on your infrastructure.
  3. Compliance and Security: Removing PII and other sensitive information is not just a best practice; it's often a regulatory requirement in industries like healthcare and finance.

Recommendations: When to Clean and When to Skip

When to Clean:

  • High-Stakes Applications: For critical data (e.g., legal, financial), meticulous cleaning ensures accuracy and compliance.
  • Noisy Data Sources: Web scraping, user-generated content, and other noisy sources require aggressive cleaning to enhance relevance.
  • Large-Scale Systems: Even minor improvements in token usage can lead to significant cost savings when scaled.

When to Skip Cleaning:

  • Mostly Clean Data: Additional cleaning may not provide significant benefits if your data is already clean.
  • Context-Sensitive Noise: In cases where noise (e.g., specific details or repetition) adds relevance, cleaning may inadvertently reduce accuracy or might not be worth your time.?


Cleaning can significantly impact your projects, whether you want to improve relevance, reduce token usage, or comply with regulations. But, like any tool, knowing when and how to use it is essential.?

Hopefully, this post gave you valuable insights into how to approach data cleaning for your RAG projects. If you found this useful or have any questions, please let me know in the comment below!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了