Building A Search Engine from Scratch - Part 2
In the previous post, we discuss the idea how to build our toy search engine https://shorturl.at/u1Sq0
In this post, we will focus on how we can scale up even further. To help us visualize the problem. let's assume we have 5 million posts we want to search on. And each original post has a post ID and content section. A natural idea would be we want to partition the data into smaller chunks.
Assume we decide to have 5 partitions and each partition has 2 replicas, each partition will get its own dir in the HDFS
The ingestion pipeline:
An offline batch job (Spark/Map Reduce) that counts the original post and creates index for each shard into the following HDFS dir,
hdfs://{search_tenant}/inverted_index/baseindex/{Index_version}/partitions/{partition_id}
Sharding Strategy:
We will use post_id % 5 to determine which shard each post belongs to. This means:
We will create an inverted index for each shard with a list of terms and their corresponding post IDs.
Shard 0 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard0/)
Contains terms from posts with post_id % 5 == 0 (Post 0, 5, 10, 15, ...).
Example output (part-00000 in HDFS):
full-text -> [0]
search -> [0, 10, 15, 20, 25]
with -> [0, 15, 30]
elasticsearch -> [0, 10, 25]
serverless -> [5]
functions -> [5]
for -> [5, 10, 20, 25, 40]
event-driven -> [5]
architectures -> [5]
is -> [10, 25]
a -> [10, 25]
great -> [10, 25]
tool -> [10, 25]
building -> [15, 40]
intelligent -> [15]
systems -> [15]
ai -> [15]
natural -> [20]
language -> [20]
processing -> [20, 30]
and -> [20, 35]
analytics -> [20]
introduction -> [30]
to -> [30]
big -> [30]
data -> [30, 45]
hadoop -> [30]
cloud -> [35]
services -> [35]
cloud-native -> [35]
development -> [35]
best -> [40]
practices -> [40]
rest -> [40]
apis -> [40]
lakes -> [45]
vs -> [45]
warehouses -> [45]
key -> [45]
differences -> [45]
Shard 1 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard1/)
Contains terms from posts with post_id % 5 == 1 (Post 1, 6, 11, 16, ...).
Example output (part-000001 in HDFS):
ai-driven -> [1]
search -> [1, 16, 26, 46]
results -> [1]
and -> [1, 21]
recommendations-> [1]
serverless -> [6]
functions -> [6]
for -> [6, 11, 26, 31]
event-driven -> [6]
architectures -> [6]
content -> [11]
delivery -> [11]
networks -> [11]
cdns -> [11]
faster -> [11, 31]
data -> [11, 26, 41]
access -> [11]
engine -> [16]
optimization -> [16]
seo -> [16]
techniques -> [16]
privacy -> [21]
security -> [21]
in -> [21]
modern -> [21]
systems -> [21]
elasticsearch -> [26]
is -> [26]
a -> [26]
great -> [26]
tool -> [26]
optimizing -> [31]
databases -> [31]
query -> [31]
response -> [31]
times -> [31]
scaling -> [36]
web -> [36]
applications -> [36, 46]
to -> [36]
millions -> [36]
of -> [36, 41]
users -> [36]
science -> [41]
the -> [41]
future -> [41]
technology -> [41]
building -> [46]
real-time -> [46]
Shard 2 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard2/)
Contains terms from posts with post_id % 5 == 2 (Post 2, 7, 12, 17, ...).
Example output (part-000002 in HDFS):
building -> [2, 37, 47]
scalable -> [2, 37]
systems -> [2, 22]
with -> [2, 37]
distributed -> [2]
databases -> [2, 7, 37]
optimizing -> [7]
for -> [7, 32]
faster -> [7, 32]
query -> [7]
response -> [7]
times -> [7]
scaling -> [12]
web -> [12]
applications -> [12, 47]
to -> [12]
millions -> [12]
of -> [12]
users -> [12]
understanding -> [17, 42]
search -> [17, 42, 47]
relevance -> [17, 42]
and -> [17, 22, 27, 42]
ranking -> [17, 42]
data -> [22, 32, 37]
privacy -> [22]
security -> [22]
in -> [22]
modern -> [22]
graph -> [27]
their -> [27]
use -> [27]
cases -> [27]
content -> [32]
delivery -> [32]
networks -> [32]
cdns -> [32]
access -> [32]
architectures -> [37]
nosql -> [37]
real-time -> [47]
Shard 3 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard3/)
Contains terms from posts with post_id % 5 == 3 (Post 3, 8, 13, 18, ...).
Example output (part-000003 in HDFS):
learn -> [3]
elasticsearch -> [3]
with -> [3, 18]
this -> [3]
guide -> [3]
artificial -> [8]
intelligence -> [8]
and -> [8, 13, 23, 28, 48]
machine -> [8, 38]
learning -> [8, 38]
go -> [8]
hand -> [8]
in -> [8]
cloud -> [13]
services -> [13]
cloud-native -> [13]
development -> [13]
building -> [18]
scalable -> [18]
data -> [18, 28, 33, 43]
architectures -> [18]
with -> (already included above)
nosql -> [18]
databases -> [18]
ai-driven -> [23]
search -> [23, 48]
results -> [23]
recommendations-> [23]
big -> [28]
technologies -> [28]
their -> [28, 48]
impact -> [28]
on -> [28]
analytics -> [28]
science -> [33]
is -> [33]
the -> [33]
future -> [33]
of -> [33]
technology -> [33]
introduction -> [38]
to -> [38]
algorithms -> [38, 48]
content -> [43]
delivery -> [43]
networks -> [43]
cdns -> [43]
for -> [43]
faster -> [43]
access -> [43]
real-world -> [48]
applications -> [48]
领英推荐
Shard 4 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard4/)
Contains terms from posts with post_id % 5 == 4 (Post 4, 9, 14, 19, ...).
Example output (part-000004 in HDFS):
data -> [4, 14, 39]
privacy -> [4]
and -> [4]
security -> [4]
in -> [4]
modern -> [4]
systems -> [4, 24, 44]
scaling -> [9]
web -> [9]
applications -> [9]
to -> [9]
millions -> [9]
of -> [9]
users -> [9]
real-time -> [14]
processing -> [14]
with -> [14, 19, 24, 34, 44]
apache -> [14]
spark -> [14]
optimizing -> [19]
database -> [19]
queries -> [19]
indexing -> [19]
strategies -> [19]
building -> [24, 34, 44]
intelligent -> [24, 44]
search -> [24, 29, 44, 49]
systems -> (already included)
ai -> [24, 44]
user -> [29, 49]
behavior -> [29, 49]
analysis -> [29, 49]
for -> [29, 39, 49]
better -> [29, 49]
experiences -> [29, 49]
apis -> [34]
that -> [34]
scale -> [34]
your -> [34]
application -> [34]
engineering -> [39]
pipelines -> [39]
analytics -> [39]
platforms -> [39]
Load data into Searchers:
After the ingestion pipeline, we should have base index for each shard in the corresponding HDFS dir, we need to spin up our fleet of Searcher machines
When a searcher machine starts, it will
1) Try to register itself as one host in the Searcher machine cluster on ZooKeeper, remember for our setup, we have 5 shards and each shard has 2 replicas
2) Load the shard data into the its memory from HDFS
Serve Live Traffic:
We will build another resource called Broker that will take live search traffic, based on if the shard key is provided(post id in our case), we have 2 routing choices here
The broker has no choice but fanout the search request to all searchers since the broker does not know which shard the search query belongs to. Each Searcher will execute the dict.get() and return a list of post ids
And broker will then merge them into 1 final doc id list
If the Post Id is provided, we can reuse the hash function to figure out the shard, then broker can ask Zookeeper for the list of hosts in that shard, since we have 2 replicas here, we will just choose a random host and forward the search traffic there. The Searcher on that host will execute the same dict.get() and return the doc Id
This sounds a weird case here because we happen to use post Id as doc key. In the real world example, you may use a field in your inverted index and then the doc id is just a UUID auto generated by the system
In our previous post, we also discuss that search result is not very user friendly since it's just raw doc id list. What we can do is in the broker, we have an extra step called materialization where we populate the actual content that we want to include in the SERP. This may involve calling external data storages for fetching non inverted index related information.
Last but not least:
wow, we have made huge progress so far! We partition our inverted index so it can handle 5 million posts! We add a broker layer to orchestrate the searchers and materialize our search result to be more user friendly
But there's still a lot improvements we can make in our search engine. For example, the current system leverage the offline job to build the index, which means the latest posts on LinkedIn will not get reflected immediately and you won't be able to search it until we rebuild the base index. We also don't talk about ranking, which plays an important role in Search ecosystem. Stay tuned and I may write another post for these topic!
If you have read this far, hope you enjoy my post and please share any feedback or suggestions you may have!
Cheers
Yijie