登录查看更多内容

Shrink Your Embeddings: Slashing Costs with MRL and BQL

Jo Kristian Bergum

Retrieval Evangelist

发布日期: 2024年10月18日

Let's face it: vector embeddings are fantastic for many tasks, but if you've ever worked with large-scale vector search, you know the pain of watching your storage and compute costs skyrocket. But what if I told you there's a way to put your embeddings on a diet without sacrificing too much of their performance?

It's soon 2025, and I'm still seeing AI developers using chunky, fat float embeddings like it’s 2020. It's like using an expensive golden hammer when a simple toy hammer would give you the same results—you're burning compute and storage for minimal accuracy gain.?

The Fat Float Embedding Dilemma

Vector search is powering everything from recommendation systems to semantic search. And with that comes a tsunami of data and ballooning costs. So why aren't more developers jumping on the embedding compression bandwagon? My guess? They either need to learn these techniques or think it's too complex to implement. And if you're still on the fence, let me show you what you're missing out on.

The power of compact embedding representations

Two techniques made waves in the world of embeddings and vector search in 2024:

Matryoshka Representation Learning (MRL)
Binary Quantization Learning (BQL)

These aren't just fancy acronyms – they're the path to slimmer, more efficient embedding vector representations. Let's break them down.

Matryoshka Representation Learning (MRL)

Think of MRL like those Russian nesting dolls (yes, that's where the name comes from). Instead of one fixed-size embedding, you get a hierarchy. Want the best possible accuracy? Use all the dimensions. Need something lighter for a small percentage drop in accuracy? Just grab the first 100 or so.?

Key benefits:

Flexibility in choosing the number of dimensions
Linear reduction in computational complexity
Maintains good performance with fewer dimensions than all

Binary Quantization Learning (BQL)

If MRL is about selectively downsizing dimensions, BQL is about compressing the dimension values. It turns your float vectors into binary—we're talking just 0s and 1s.?

Key benefits:

Massive reduction in storage requirements (32x compared to float vectors)
Significantly faster similarity computations using Hamming distance (20s faster)
Enables efficient scaling of vector datasets

Combining MRL and BQL

Combine these two techniques, and you've got a powerhouse of efficiency. The two techniques are compatible; first, MRL is performed, and then BQL is used to reduce the precision. MRL allows flexibility in the number of dimensions, while BQL provides flexibility in the precision per dimension. Perfectly compatible.?

领英推荐

LLM Evaluation, AI Side Projects, User-Friendly Data…

Towards Data Science 4 个月前

Data Analytics in the Age of AI, When to Use RAG…

Open Data Science Conference (ODSC) 11 个月前

Cost savings by using DeepSeek R1 for Product Taxonomy…

Itransition Group 3 周前

Storage savings: A concrete example

Let's put some numbers to this. Imagine you're storing 1 billion 1024-dimensional vectors:

That's right – you're looking at cost savings of up to 64x.?

Search performance gains

These compact representations aren't just easier on your storage footprint; they're a boost for similarity searchers, too:

Hamming distance computations (used for binary vectors) are about 20 times faster than float dot products
This speed-up allows for higher query throughput or reduced CPU costs

In other words, you're saving money and cutting down latency by 20x, or you need 20x less CPU to get the job done.??

Implementing in Vespa

Now, if you're using Vespa (and if you're not, you might want to consider it), you're in luck. Vespa supports both MRL and BQL in its native Hugging Face embedder. Here's a quick taste of what that looks like:

schema doc {

  document doc {

   field text type string {..}

  }

  field mrl_bq_embedding type tensor<int8>(x[64]) {

     indexing: input text | embed mxbai | attribute | index

     attribute {

        distance-metric: hamming

     }

  }

}

This little snippet creates a 64-dimensional binary embedding, combining MRL and BQL. Read more in this long blog post on MRL and BQL.?

Real-world impact

So, what does all this mean in the real world? By shrinking your fat float embeddings, you can:

Reduce storage costs dramatically
Increase query throughput and lower latency?
Enable new use cases that were previously too expensive
Implement more complex ranking pipelines within latency constraints

Yes, You're not just slashing costs and improving performance but opening up new possibilities to embed more data.?

Conclusion

Look, I get it. Changing your embedding strategy might seem like a hassle. But if you ask me, the cost reduction is worth it. By leveraging techniques like MRL and BQL, you're not just trimming the fat float but unlocking new use cases.

So go ahead and shrink those embeddings. Your systems (and your budget) will thank you. And hey, you might just be the one to show your team how to save a boatload of cash next year.

Michael Cizmar

AI Search Expert | Leading MC+A

4 个月

I thought there wasn't supposed to be any homework this weekend teacher.

1 次回应

Xiaoze Jin

Multi-modal Data, Gen AI, Agentic AI and Physical AI Science Engineering

4 个月

Love this Jo Kristian Bergum ????#vectorsearch #embedding #representation

1 次回应

查看更多评论

要查看或添加评论，请登录

Jo Kristian Bergum的更多文章

From ML Teams to API Calls: The Illusion of Simplicity

2025年2月6日

From ML Teams to API Calls: The Illusion of Simplicity

What once required dedicated machine learning teams, months or years of data collection, and complex training pipelines…

4 条评论
Why AI Agents Are Forcing Enterprises to Rethink Retrieval Investments

2025年1月27日

Why AI Agents Are Forcing Enterprises to Rethink Retrieval Investments

Enterprise search tools lingered in the background for decades—a minor employee efficiency booster with a low…

2 条评论
The Anatomy of Large-Scale Recommender Systems

2025年1月20日

The Anatomy of Large-Scale Recommender Systems

Modern real-time recommender systems power many of today's most engaging platforms. While TikTok's implementation…

1 条评论
Why AI Giants Are Suddenly Obsessed With Enterprise Search

2025年1月13日

Why AI Giants Are Suddenly Obsessed With Enterprise Search

The AI giants have a critical weakness. Their frontier models, trained on vast internet data, fail in enterprise…

6 条评论
2024: The rise and fall of the vector database infrastructure category

2025年1月3日

2024: The rise and fall of the vector database infrastructure category

I've spent the last few years watching embedding technologies transform from Big Tech's "secret sauce" into everyday…

14 条评论
Stop Using Vector Indexes (When You Don't Need Them)

2024年11月11日

Stop Using Vector Indexes (When You Don't Need Them)

Here's an article that might save you thousands of $ per day: Your vector search use case probably doesn't need that…

9 条评论
A Practical Guide to Benchmarking Search Systems

2024年11月8日

A Practical Guide to Benchmarking Search Systems

In my early career days, I overheard a senior engineer saying that "we should deploy these systems well below the knee…

2 条评论
Why separating compute from storage is a bad idea for late interaction models like ColPali

2024年10月18日

Why separating compute from storage is a bad idea for late interaction models like ColPali

While late-interaction models offer compelling benefits, naive implementations can lead to severe performance…

2 条评论

See all articles

Shrink Your Embeddings: Slashing Costs with MRL and BQL

Jo Kristian Bergum

Retrieval Evangelist

The power of compact embedding representations

Matryoshka Representation Learning (MRL)

Binary Quantization Learning (BQL)

Combining MRL and BQL

领英推荐

Storage savings: A concrete example

Conclusion

Jo Kristian Bergum的更多文章

社区洞察

其他会员也浏览了

Creating Dynamic Data Visualizations with OpenAI's GPT-3 and React

Functionary V2.4 Model Release

September Highlights

Automating data preparation and preprocessing in ML models

From UI to Data Processing to GPT-3: Antti’s Career at Trustmary

From Blank Canvas to a Brilliant Presentation with AI

Get Started With GraphRAG

How your enterprise should use a vector database for its LLM apps - AI&YOU #54

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

The power of compact embedding representations

Matryoshka Representation Learning (MRL)

Binary Quantization Learning (BQL)

Combining MRL and BQL

领英推荐

Storage savings: A concrete example

Conclusion

Jo Kristian Bergum的更多文章

From ML Teams to API Calls: The Illusion of Simplicity

Why AI Agents Are Forcing Enterprises to Rethink Retrieval Investments

The Anatomy of Large-Scale Recommender Systems

Why AI Giants Are Suddenly Obsessed With Enterprise Search

2024: The rise and fall of the vector database infrastructure category

Stop Using Vector Indexes (When You Don't Need Them)

A Practical Guide to Benchmarking Search Systems

Why separating compute from storage is a bad idea for late interaction models like ColPali

社区洞察

其他会员也浏览了

Creating Dynamic Data Visualizations with OpenAI's GPT-3 and React

Functionary V2.4 Model Release

September Highlights

Automating data preparation and preprocessing in ML models

From UI to Data Processing to GPT-3: Antti’s Career at Trustmary

From Blank Canvas to a Brilliant Presentation with AI

Get Started With GraphRAG

How your enterprise should use a vector database for its LLM apps - AI&YOU #54

Synerise open-sourcing Cleora AI framework for ultra-fast embeddings in large graphs

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%