Testing dense vector search at scale - Part 1: ANN Search
Vector search is here to stay. While it isn't done yet riding the hype wave, after more hot air has escaped, I'm sure enough useful substance remains make a lasting impact on how we build search solutions. As a team responsible for search in production, we'd better get prepared for supporting vector search at scale for a variety of scenarios. Public information and our own early experiments make clear that designing vector search systems is more than ever a matter of balancing quality, speed and low cost. That is why we started this extensive exercise to learn what influences vector search performance, what effects relevance and how expensive it is to improve both. This article focuses on the performance component of this wider investigation.
While the market appears to be swamped with native vector search engines, it begs the question: how well does the engine we are currently familiar with handle vector search and is that good enough for our use cases? We are already using Solr for keyword search and as of version 9 it leverages ANN search in Lucene's HNSW index implementation. The prospect of being able to combine vector search with its excellent keyword search is attractive.
To load/stress test a vector search engine effectively, you must first understand some basics. The laws of nature behave differently for Approximate Nearest Neighbor search than for retrieval from an inverted index. The latter is an exact lookup in a posting list (either the query text matches a document or it does not) whereas an ANN search attempts to discover the top most closely related documents given the input vector. The Hierarchical Navigable Small World index is a graph of documents where related documents are connected with each other. A search in this graph is essentially a graph traversal, hopping across nodes and collecting result candidates that are most similar to the query vector. Documents must be vectorized before they could be inserted in this graph and queries must be vectorized before they could be searched. Both vectorizing text (also known as "embedding inference") and ANN search were tested independently. This article covers ANN search.
Test conditions
Content set:
Sentence transformer models:
System under test (SUT):
Default vector field config:
Jmeter test settings:
Segments
The first lesson learned, and an important one, was that Lucene's segmented index has a significant influence on the performance of ANN search. These segments exist for good reasons: they enable for efficient incremental updates, while reducing the risk of index corruption due to their immutability. However, each segment holds its own HNSW graph, there are as many graph traversals needed as there are segments and as far as I can see, they are not searched in parallel. With Solr's default configuration, the many tiny segments make ANN search performance rather underwhelming. A forced optimize merges all small segments into a single large segment, which speeds up ANN search by up to 10x in some circumstances, a difference too large to ignore. We opted to run all performance tests after a forced optimization, recognizing that this has drawbacks for content sets that require frequent updates.
Lessons learned: ANN search in Solr necessitates a meticulous segment merging strategy.
Note: Elastic's recent efficiency gains in multi-graph vector search are encouraging, and hopefully Solr will follow suit.
Vector length
To test the impact of the number of dimensions per vector, we indexed the same content set five times, once without vectors and four times with vectors generated from each of the following sentence transformer models:
As one would expect, the larger the vector size the larger the index size:
Note: the avg index speed includes the throughput time of our custom import pipeline in front of the search engine.
Number of documents
ANN search with our default load test settings produces latencies near to 10ms in most cases, even if we'd let the index expand to 10M documents, which is not bad at all. One index was optimized to three segments instead of one, and it performs three times slower than the fully optimized indices.
Simultaneous collections
Let's try to search multiple large collections simultaneously. We varied with several combinations of collections of different sizes. This works quite well, as long as all collections that are being searched fit entirely into the available memory. In one situation, the combined size of collections was 110Gb which exceeded the available memory of 96Gb, resulting in significant latencies of more than 10 seconds.
Lessons learned: for ANN search to be effective, the entire index must fit in memory, and when it does then searching large amount of vectors across multiple collections under load is doable with minimal latencies.
K value
The k parameter is specified at query time and indicates how many nearest neighbors ANN search should return. However, be aware that k not only determines the (maximum) number of results, but it also influences relevance!
This is something not always understood when coming from a keyword search world: k≠rows. The larger k, the more likely that the nearest neighbors will appear at the top of the results list. Remember ANN search is attempting to find the closest documents and the chances of success grow with every next document it compares with the query.
The k number is expected to have a direct link with the latency, simply because the engine needs to find more results. This is confirmed by the tests:
Lessons learned: while a higher k value improves the precision of top results, latencies increase too, so this requires tuning for the sweet spot.
Similarity function
We are getting more into the weeds now. Solr supports three ways of comparing vectors with each other:
Which similarity function works best depends on the use case, how the embedding model was trained and the vector size: Euclidean distance is not a suitable metric in high-dimensional environments (see curse of dimensionality). You cannot always utilize the raw output of an embedding inference with every similarity function; sometimes conversion is needed and sometimes conversion isn't even possible. Understand that a vector has two components: direction and magnitude. Direction is the "preference" / "style" / "sentiment" / "latent variable" of the vector, while its magnitude is how strong it is towards that direction.
For our speed & throughput testing, we are (for now) ignoring the relevancy aspect and using every similarity function with each model. Given the complexity of the calculation, we would expect dot_product to performs best, followed by euclidean, and cosine to perform worst. Per similarity function the tests are repeated for increasing k values to evaluate if k factors in some way.
Lessons learned: as predicted, dot_product is the fastest similarity formula in all circumstances. However, there is an important difference between euclidean and cosine algorithms: whereas cosine is the slower method for lower k values, Euclidean obviously performs worse for higher k values. At this time, I don't have an explanation for this finding, but please share if you do.
领英推荐
Vector encoding
Lucene's HNSW supports two methods of encoding vector elements: FLOAT32 and BYTE (as int8). With the default FLOAT32 type, 4 bytes are allotted for each dimension in a vector, whereas a BYTE only has 1 byte. Converting a FLOAT32 vector into a BYTE variant is possible via a process called scalar quantization (maintaining the vector size, but mapping each FLOAT32 value into a integer between -128 and 127). Quantization is primarily used to reduce the memory footprint and accelerate the search process in high-dimensional vector spaces. The trade-off for these speed & costs gains is a reduction in search quality. How much this precision loss is has yet to be assessed and is according to other articles related to the number of dimensions.
All of the models used in this test produce FLOAT32 vectors, so how to convert this into a vector of BYTEs? The answer depends on how we want to compare vectors with each other.
If the similarity function is Euclidean, the relative distance must survive the conversion (as much as possible). To scale FLOAT32 vectors to a number in-between -128 and 127, you must first know the maximum and minimum values of all vectors of the index. This means that all documents must be vectorized before we can quantize them relative to one another.
If, on the other hand, cosine is used, vectors must keep their angles (as much as possible). Scalar quantizing while maintaining the angle could be done vector by vector and doesn't require prior knowledge of all vectors in the data set. For our test we chose this technique that preserves the vector's angle:
This is a brute force way of quantizing and for 0.001% of these conversions we got invalid vectors (with values lower than -128 or higher than 127), for this test that loss was regarded as acceptable.
The impact quantization has on index size is clearly noticeable, especially for indices of larger vectors.
We do not observe a 4x reduction of the index size because each node in the HNSW contains more than just its vector values (such as data about the connections to other nodes) and our content contains other metadata besides the embeddings. However, the memory savings are still significant.
Also, at query side we see performance gains which become even more obvious when raising the k value.
Another way to increase the stress is adding more simultaneous users querying the SUT. As expected, more users querying in parallel cause latencies to go up, but this happens less quickly for BYTE encoding compared to FLOAT32 encoding.
Lessons learned: encoding vector values as BYTE i.s.o. FLOAT32 is an excellent way to lower the index size, to lower the query latencies and to allow heavier query load.
Filtering
Filter makes it evident that ANN search differs from inverted index search. Filtering in a HNSW index could significantly drop the performance. This is because as the engine scans each layer of the HNSW network, it generates a list of potential candidates. This task is harder when nodes (document candidates) are filtered out of the graph, so there is more traversing needed to create a candidate list of nearest neighbors. The narrower the scope is of a filter, the more computing is needed for an ANN search (with higher latencies).
Lucene has however implemented an optimization for very narrow filters, see: LUCENE-10382. With this logic the engine falls back to brute force (exact) kNN search when the filter-ratio falls below a specific threshold. Let us find out how this looks in practice.
Indeed, we see a clear trend: wide filters have little impact on latencies while latencies increase the more narrow a filter is. This could even grow to problematic levels, as we can see in the test with the second model of 1024 dimensions. When the filter narrows down to a few percentages of the content set, we witness large spikes in latencies. When the filter is even narrower, the search performs better again, most likely because of LUCENE-10382.
But wait, what's that? A test with a filter that scopes to 36.021 results shows slow responses, while another test with another but equally narrow filter (36.003 results) performs about 7x faster. How is that possible? We don't know for sure but here is a hypothesis:
In these filtered test cases we applied filters on various metadata. The test with slow response times had a filter with only one term, while the next test had an equally narrow filter with multiple OR-ed terms. This was not a one-time coincidence, other tests also made clear that 1-term filters perform worse than multi-term filters with equal number of results. It appears that 1-term filters make the HNSW graph less navigable, perhaps because the documents in scope are more related to each other (appearing closer together in the graph) than multi-term filtered results. In other words: the blind spot in the graph could be bigger with a 1-term filter.
Lessons learned: broad filters have a tiny negative impact on the latency, while narrower filter increase the latencies more noticeably. Very narrow filters are quick again (because graph traversal is bypassed). Furthermore, the HNSW traversal speed is determined by not only the narrowness of a filter, but also the variety of documents that remain after a filter: concentrated in one location or distributed throughout the graph.
What could we do about slow responses for filtered queries? A lower k value certainly helps, though this lowers relevancy and is not always practical for other reasons. Perhaps the next set of tests could provide a better solution.
hnswMaxConnections
hnswMaxConnections (also known as Mmax, or just M) is an index-time setting. When building a HNSW graph it controls how many of the nearest neighbor candidates are connected to new nodes that are inserted. The expectation is that increasing hnswMaxConnections link each node to more neighbors and make the graph denser, which may speed up search times (for filtered queries) at the cost of longer index build times and higher memory usage.
We adjusted hnswMaxConnections between 16 (default), 32, 64 and 128. The effect on index size is remarkably zero. One would expect that denser graphs are bigger in volume, but no evidence for that assumption. Does Lucene reserve a fixed number of bytes for the connections per node or is the hnswMaxConnections parameter ignored?
On the query side, we also don't notice a clear correlation between hnswMaxConnections and latencies. It may have helped to reduce the latencies for the problem filter at the index with vectors from model large B. But other than that, in general larger hnswMaxConnections values are no guarantee for lower response times.
It is worth noting that every time an HNSW graph is created from scratch from the same content set and parameters, still a unique structure is created. There is randomness involved when determining which node appears in each layer. Maybe this randomness explains the outcome of these tests more than the effect of hnswMaxConnections.
Lessons learned: a higher hnswMaxConnections is not a universal performance booster but it seems to help in some filter scenarios. Though, we are not 100% sure whether this parameter works as intended.
hnswBeamWidth
Another hyperparameter is hnswBeamWidth (also known as ef_construction) which defines the size of the candidate list for new nodes to connect to, used during the index building process. A higher hnswBeamWidth value allows the algorithm to consider more candidates when building a HNSW graph, potentially improving the index quality. However, it also slows down the index building process, as more candidates mean more distance calculations.
With this theory in mind, we do not expect hnswBeamWidth to impact the index sizes or query latencies. It is expected to be a trade-off between index speed and relevance.
We varied hnswBeamWidth with 50, 100 (default) and 200. The index sizes are index unaffected by hnswBeamWidth, although the index speed does vary. These variations, however, do not entirely meet the expectations. Why does indexing take longer with hnswBeamWidth=50 than with hnswBeamWidth=100?
At query side there are no surprises:
Lessons learned: as expected hnswBeamWidth has no impact on query latency.
Not yet tested
Still on the TODO list are performance tests for:
Conclusion
This article zooms in on test results of many aspects that influence the index size and query speeds of ANN search in a Lucene HNSW graph. The tests were carried out under production-like conditions and provided useful insights for when balancing within the cost-speed-quality triangle is needed.
Note that ANN search is only half of the query performance story. The other half involves transforming (query) text into vectors, also known as embedding inference. This is something I'll cover in a future article called Testing dense vector search at scale - Part 2: Embedding inference.
Senior Software Developer, Building Scalable ML and AI based Solutions, Relevancy Engineering & Open Source Advocate
11 个月Thank you for making time to document the work and sharing it, Tom Burgmans. Some of the screenshots attached are not visible, if you could write a short note about the metrics instead that would be super helpful. As Charlie Hull mentioned, look forward to your talk at Haystack.
Leading expert in site & enterprise search & AI: consultant, author, speaker & blogger.
12 个月Great work! Makes me look forward even more to your talk at Haystack....