登录查看更多内容

Cross-encoder for vector search re-ranking

Viktor Qvarfordt

VP of Engineering at Sana

发布日期: 2023年11月28日

In RAG applications, cross-encoders are used for re-ranking vector search. How?

First, one uses vector search to get the top few hundred most relevant chunks. Then, the chunks are re-ranked with a cross-encoder. Why two steps?

Cross-encoders give high quality ranking but are slow. Vector search gives ok ranking but is fast. It is fast because vector similarity can be computed using an index. Think of this in the same way as an SQL database creates an index for efficient lookup. Why not index the cross-encoder?

A cross encoder takes two strings as input and gives a relevance score as output. An index cannot be created from this. Ok, but how to use it?

Run the cross encoder on your query against all chunks. Then sort by the score. If you have few chunks and need high quality ranking, you might not need vector search at all, just cross-encoding.

Cross-encoder example code

Example code for running a cross-encoder to score your chunks. Here is a dev server using a pre-trained cross-encoder from Hugging Face, https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2:?

领英推荐

SkipList: A probabilistic data structure

Vivek Bansal 5 个月前

Vector Search: The Next Generation of Intelligent…

Sanjay Kumar MBA,MS,PhD 5 个月前

Six Secret SPARQL Ninja Tricks

Kurt Cagle 3 年前

# file: server.py

from flask import Flask, request, jsonify
from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)

app = Flask(__name__)


@app.route("/score", methods=['POST'])
def score():
    query = request.json['query']
    chunks = request.json['chunks']

    scores = model.predict([(query, chunk) for chunk in chunks]).tolist()

    index_with_score = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)

    return jsonify(index_with_score)


app.run(port=8000)

Start the server:

$ python server.py

Use it to rank a query against chunks:

curl https://localhost:8000/score \
  -H 'content-type: application/json' \
  -d '{
    "query": "Who was the first person to walk on the moon?",
    "chunks": [
      "The moon landing was watched by millions of people around the world on television. This historical event marked the success of the Apollo program and was a significant victory in the space race during the Cold War.",
      "Apollo 11 was the first manned mission to land on the moon. The spacecraft carried three astronauts: Neil Armstrong, Buzz Aldrin, and Michael Collins. Neil Armstrong and Buzz Aldrin both walked on the moon’s surface, whereas Michael Collins orbited above.",
      "Space exploration has evolved significantly since the early missions. Today, organizations not only from the United States and Russia but also from Europe, China, and India are actively participating in exploring outer space, with goals to land on the moon and Mars."
    ]
  }'

Outputs:

[
  [1, 8.101385116577148],
  [0, -5.674482345581055],
  [2, -8.991270065307617]
]

This is correct since the answer to the query is found only in the second chunk.

Sergei Bogdanov ??

Research Engineer @ Twin | école Polytechnique | Toastmasters

1 年

Very informative and straight to the point, thank you very much!

Viktor Qvarfordt

VP of Engineering at Sana

1 年

Great question. I'd say the LLM text generation is by far the slowest subsystem in a RAG. In the retrieval part, one should distinguish indexing from actual retrieval. Indexing happens asynchronously in the background and can be slower. For real-time retrieval, the vector search is a potential bottleneck, as is any other context augmentation such as traversing a knowledge graph. We aim to perform retrieval and compile all the context for the LLM prompt in less than 100 ms. We reach this low latency by running a high-performance vector search database and self-hosting the models we use (incl. the vector embedding model and the cross-encoder) with NVIDIA Triton Inference Server. This gives both high performance and low network latency since it then runs in the same cloud region as our other backend services.

1 次回应

Jonathan Klasén

AI Tech Lead | Driving Transformative GenAI Strategies & Industrial-Scale Solutions | Innovating with Multi-Agentic Flows & AI Agents | Enhancing Developer Experience

1 年

Re-rankers are great! It's such a simple method that can be used to really improve the recall performance.

1 次回应

Birger Mo?ll

PhD Student in Artificial Intelligence p? Kungliga Tekniska h?gskolan

1 年

Very interesting. Is there a theoretical limit for fast you can make the retrieval? Is it useful to change python to Rust (similar to how tokenisation is done in HF transformers) to speed it up? Do you have some recommendations for good articles on RAG? In terms of usability have you noticed how long is too long for retrieval? For a website roughly 1 second would be okay, is it similar for LLM-based applications? Maybe it could quite different if you're using it in a work setting, then people are more used to waiting for things.

1 次回应

查看更多评论

要查看或添加评论，请登录

Viktor Qvarfordt的更多文章

LLM Agents: Overview and implementation

2024年1月19日

LLM Agents: Overview and implementation

What do we exactly mean by an agent? And how do we implement them? This is what we will explore in this article. LLM…

7 条评论
You don't need a graph database: Modeling graphs and trees in PostgreSQL

2023年12月25日

You don't need a graph database: Modeling graphs and trees in PostgreSQL

In this article, we explore how Postgres, a powerful and versatile relational database, can be effectively used to…

14 条评论
Postgres data isolation and Row Level Security

2023年12月4日

Postgres data isolation and Row Level Security

What? When building SaaS systems, customer data isolation is essential. This can be achieved in different ways, with…

3 条评论

Cross-encoder for vector search re-ranking

Viktor Qvarfordt

VP of Engineering at Sana

Cross-encoder example code

领英推荐

Viktor Qvarfordt的更多文章

社区洞察

其他会员也浏览了

Smart Grid Search for Faster Hyperparameter Tuning

Using Retrievability to Measure Recall

Swift 5.7: Regex. Shorthands for optional unwrapping. Unlock existentials for all protocols.

Mastering QLineEdit in PyQt5

Stop Using Regex—You’re Wasting Your Time

Unlocking the Power of Public Web Data with Scrapeless: A Game-Changer for Enterprises

C++ Core Guidelines: More Rules to Performance

Let's Build a Balloon Tracker Website Part 2: Flask

Pandas - Duplicate Row Detection and Grouping

Cross-encoder example code

领英推荐

Viktor Qvarfordt的更多文章

LLM Agents: Overview and implementation

You don't need a graph database: Modeling graphs and trees in PostgreSQL

Postgres data isolation and Row Level Security

社区洞察

其他会员也浏览了

Smart Grid Search for Faster Hyperparameter Tuning

Using Retrievability to Measure Recall

Swift 5.7: Regex. Shorthands for optional unwrapping. Unlock existentials for all protocols.

Mastering QLineEdit in PyQt5

Stop Using Regex—You’re Wasting Your Time

Unlocking the Power of Public Web Data with Scrapeless: A Game-Changer for Enterprises

C++ Core Guidelines: More Rules to Performance

Let's Build a Balloon Tracker Website Part 2: Flask

Pandas - Duplicate Row Detection and Grouping