To shard or not-to-shard your vector database
To shard or not to shard your vector database depends on the throughput and latency needed by your application. Throughput is the rate of data processing whereas latency is the time it takes to index data or serve queries.
You can build a similarity index using a single machine or using multiple machines (ie: sharding).? Index building needs cpu, memory and storage. If you want a single gigantic index for all your data, get a large machine with plenty of cpu, memory and storage, and build one index. This is called vertical scaling. The alternative is to piecemeal your data into N shards, and then allow each of N machines to process one shard of your data. This is horizontal scaling.
In the following sections, I am specifically referring to FAISS-IVF implementation for a similarity index.?
Building a similarity search? index
The FAISS-IVF has a coarse index-building time that generates the cluster boundaries and then a fine-grain index building time that inserts each vector into a cluster. The cpu-hours needed to generate the coarse-index is very small compared to the cpu-hours needed to index every document into its relevant cluster. The total amount of cpu needed to build the fine-grain index depends on the number of documents in the dataset, it does not depend on the number of clusters that are being created and does not depend on the number of shards in the database. Since the cpu needed for building the fine-grain index dominates the total compute needed for index building, it means that sharding has a very small impact on the total cpu-seconds needed to build a similarity index.
领英推荐
Querying a similarity search index
What about the cpu needed to serve a search query in a sharded vector database versus a non-sharded vector database? Most search systems are doc-sharded (with a random hash), which means that a query fans out to all the shards of the database, and a top level aggregator assembles the results from all the shards and returns the final result to the caller. Lets ignore the communication overhead between aggregator and the shards, instead, lets focus more on the cpu usage consumed by a query in a sharded system vs a non-sharded system. Generally, we assume that the number of clusters created in a non-sharded system is the same as the number of clusters created in EACH of the shards in a sharded system (because the system is doc-sharded).
In a non-sharded system, a? FAISS similarity search will return the centroids that are closest to the query embedding. In a sharded system, the same query will trigger N similarity search queries in parallel, one similarity search in each shard. This essentially means that the total cpu-seconds needed to serve one specific query in a sharded system is N times larger than the cpu-seconds needed in a non-sharded system. Is it fair to say that vector search efficiency is better if your database is non-sharded but you get lower query latency if your database is sharded?
Recall in a similarity search index
What about accuracy of the similarity search? In a sharded system, the cluster boundaries in each shard could be different because each shard is built independently of the other. This can affect the recall metrics when compared to a non-sharded system. Does anybody have practical experience to validate or refute this hypothesis?
Get answers to these questions and more
If you want to hear how engineers at Netflix, DoorDash, Uber are approaching the question of sharding a vector database, register for Index. It’s the free community conference for engineers building search, analytics and AI applications at scale. Join virtually or in-person on May 16th at the Computer History Museum in Mountain View, CA: https://rockset.com/index-conf/
I help $5M+ companies build in-house capability by utilizing offshore teams, proven SOPs & AI. Co-founder of Concertina, an ethical Philippines-based BPO that specializes in marketing, sales & customer service teams.
4 个月Thanks for sharing.
Data and Distributed Systems Engineer, Tech Leader
7 个月"This essentially means that the total cpu-seconds needed to serve one specific query in a sharded system is N times larger than the cpu-seconds needed in a non-sharded system." I am not following this assertion Dhruba. In a sharded system won't the CPU consumption on each shard be 1/N because the list of documents to be compared is 1/N on each shard?