Cross-encoder for vector search re-ranking
In RAG applications, cross-encoders are used for re-ranking vector search. How?
First, one uses vector search to get the top few hundred most relevant chunks. Then, the chunks are re-ranked with a cross-encoder. Why two steps?
Cross-encoders give high quality ranking but are slow. Vector search gives ok ranking but is fast. It is fast because vector similarity can be computed using an index. Think of this in the same way as an SQL database creates an index for efficient lookup. Why not index the cross-encoder?
A cross encoder takes two strings as input and gives a relevance score as output. An index cannot be created from this. Ok, but how to use it?
Run the cross encoder on your query against all chunks. Then sort by the score. If you have few chunks and need high quality ranking, you might not need vector search at all, just cross-encoding.
Cross-encoder example code
Example code for running a cross-encoder to score your chunks. Here is a dev server using a pre-trained cross-encoder from Hugging Face, https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-12-v2:?
领英推荐
# file: server.py
from flask import Flask, request, jsonify
from sentence_transformers import CrossEncoder
model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2', max_length=512)
app = Flask(__name__)
@app.route("/score", methods=['POST'])
def score():
query = request.json['query']
chunks = request.json['chunks']
scores = model.predict([(query, chunk) for chunk in chunks]).tolist()
index_with_score = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return jsonify(index_with_score)
app.run(port=8000)
Start the server:
$ python server.py
Use it to rank a query against chunks:
curl https://localhost:8000/score \
-H 'content-type: application/json' \
-d '{
"query": "Who was the first person to walk on the moon?",
"chunks": [
"The moon landing was watched by millions of people around the world on television. This historical event marked the success of the Apollo program and was a significant victory in the space race during the Cold War.",
"Apollo 11 was the first manned mission to land on the moon. The spacecraft carried three astronauts: Neil Armstrong, Buzz Aldrin, and Michael Collins. Neil Armstrong and Buzz Aldrin both walked on the moon’s surface, whereas Michael Collins orbited above.",
"Space exploration has evolved significantly since the early missions. Today, organizations not only from the United States and Russia but also from Europe, China, and India are actively participating in exploring outer space, with goals to land on the moon and Mars."
]
}'
Outputs:
[
[1, 8.101385116577148],
[0, -5.674482345581055],
[2, -8.991270065307617]
]
This is correct since the answer to the query is found only in the second chunk.
Research Engineer @ Twin | école Polytechnique | Toastmasters
1 年Very informative and straight to the point, thank you very much!
VP of Engineering at Sana
1 年Great question. I'd say the LLM text generation is by far the slowest subsystem in a RAG. In the retrieval part, one should distinguish indexing from actual retrieval. Indexing happens asynchronously in the background and can be slower. For real-time retrieval, the vector search is a potential bottleneck, as is any other context augmentation such as traversing a knowledge graph. We aim to perform retrieval and compile all the context for the LLM prompt in less than 100 ms. We reach this low latency by running a high-performance vector search database and self-hosting the models we use (incl. the vector embedding model and the cross-encoder) with NVIDIA Triton Inference Server. This gives both high performance and low network latency since it then runs in the same cloud region as our other backend services.
AI Tech Lead | Driving Transformative GenAI Strategies & Industrial-Scale Solutions | Innovating with Multi-Agentic Flows & AI Agents | Enhancing Developer Experience
1 年Re-rankers are great! It's such a simple method that can be used to really improve the recall performance.
PhD Student in Artificial Intelligence p? Kungliga Tekniska h?gskolan
1 年Very interesting. Is there a theoretical limit for fast you can make the retrieval? Is it useful to change python to Rust (similar to how tokenisation is done in HF transformers) to speed it up? Do you have some recommendations for good articles on RAG? In terms of usability have you noticed how long is too long for retrieval? For a website roughly 1 second would be okay, is it similar for LLM-based applications? Maybe it could quite different if you're using it in a work setting, then people are more used to waiting for things.