登录查看更多内容

Building A Search Engine from Scratch - Part 2

Yijie Wu

Engineering @ Squarespace | Startup Advisor | Future VC to be

发布日期: 2025年2月16日

In the previous post, we discuss the idea how to build our toy search engine https://shorturl.at/u1Sq0

In this post, we will focus on how we can scale up even further. To help us visualize the problem. let's assume we have 5 million posts we want to search on. And each original post has a post ID and content section. A natural idea would be we want to partition the data into smaller chunks.

Assume we decide to have 5 partitions and each partition has 2 replicas, each partition will get its own dir in the HDFS

The ingestion pipeline:

An offline batch job (Spark/Map Reduce) that counts the original post and creates index for each shard into the following HDFS dir,

hdfs://{search_tenant}/inverted_index/baseindex/{Index_version}/partitions/{partition_id}

{search_tenant}: name of the search vertical, here will be post search
{Index_version} : timestamp we run this offline job
{partition_id}: the id of the partition generated during the offline job

Sharding Strategy:

We will use post_id % 5 to determine which shard each post belongs to. This means:

Shard 0: Posts with post_id 0, 5, 15, 20, ...
Shard 1: Posts with post_id 1, 6, 11, 16, ...
Shard 2: Posts with post_id 2, 7, 12, 17, ...
Shard 3: Posts with post_id 3, 8, 13, 18, ...
Shard 4: Posts with post_id 4, 9, 14, 19, ...

We will create an inverted index for each shard with a list of terms and their corresponding post IDs.

Shard 0 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard0/)

Contains terms from posts with post_id % 5 == 0 (Post 0, 5, 10, 15, ...).

Example output (part-00000 in HDFS):

full-text      -> [0]
search         -> [0, 10, 15, 20, 25]
with           -> [0, 15, 30]
elasticsearch  -> [0, 10, 25]
serverless     -> [5]
functions      -> [5]
for            -> [5, 10, 20, 25, 40]
event-driven   -> [5]
architectures  -> [5]
is             -> [10, 25]
a              -> [10, 25]
great          -> [10, 25]
tool           -> [10, 25]
building       -> [15, 40]
intelligent    -> [15]
systems        -> [15]
ai             -> [15]
natural        -> [20]
language       -> [20]
processing     -> [20, 30]
and            -> [20, 35]
analytics      -> [20]
introduction   -> [30]
to             -> [30]
big            -> [30]
data           -> [30, 45]
hadoop         -> [30]
cloud          -> [35]
services       -> [35]
cloud-native   -> [35]
development    -> [35]
best           -> [40]
practices      -> [40]
rest           -> [40]
apis           -> [40]
lakes          -> [45]
vs             -> [45]
warehouses     -> [45]
key            -> [45]
differences    -> [45]

Shard 1 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard1/)

Contains terms from posts with post_id % 5 == 1 (Post 1, 6, 11, 16, ...).

Example output (part-000001 in HDFS):

ai-driven      -> [1]
search         -> [1, 16, 26, 46]
results        -> [1]
and            -> [1, 21]
recommendations-> [1]
serverless     -> [6]
functions      -> [6]
for            -> [6, 11, 26, 31]
event-driven   -> [6]
architectures  -> [6]
content        -> [11]
delivery       -> [11]
networks       -> [11]
cdns           -> [11]
faster         -> [11, 31]
data           -> [11, 26, 41]
access         -> [11]
engine         -> [16]
optimization   -> [16]
seo            -> [16]
techniques     -> [16]
privacy        -> [21]
security       -> [21]
in             -> [21]
modern         -> [21]
systems        -> [21]
elasticsearch  -> [26]
is             -> [26]
a              -> [26]
great          -> [26]
tool           -> [26]
optimizing     -> [31]
databases      -> [31]
query          -> [31]
response       -> [31]
times          -> [31]
scaling        -> [36]
web            -> [36]
applications   -> [36, 46]
to             -> [36]
millions       -> [36]
of             -> [36, 41]
users          -> [36]
science        -> [41]
the            -> [41]
future         -> [41]
technology     -> [41]
building       -> [46]
real-time      -> [46]

Shard 2 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard2/)

Contains terms from posts with post_id % 5 == 2 (Post 2, 7, 12, 17, ...).

Example output (part-000002 in HDFS):

building      -> [2, 37, 47]
scalable      -> [2, 37]
systems       -> [2, 22]
with          -> [2, 37]
distributed   -> [2]
databases     -> [2, 7, 37]
optimizing    -> [7]
for           -> [7, 32]
faster        -> [7, 32]
query         -> [7]
response      -> [7]
times         -> [7]
scaling       -> [12]
web           -> [12]
applications  -> [12, 47]
to            -> [12]
millions      -> [12]
of            -> [12]
users         -> [12]
understanding -> [17, 42]
search        -> [17, 42, 47]
relevance     -> [17, 42]
and           -> [17, 22, 27, 42]
ranking       -> [17, 42]
data          -> [22, 32, 37]
privacy       -> [22]
security      -> [22]
in            -> [22]
modern        -> [22]
graph         -> [27]
their         -> [27]
use           -> [27]
cases         -> [27]
content       -> [32]
delivery      -> [32]
networks      -> [32]
cdns          -> [32]
access        -> [32]
architectures -> [37]
nosql         -> [37]
real-time     -> [47]

Shard 3 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard3/)

Contains terms from posts with post_id % 5 == 3 (Post 3, 8, 13, 18, ...).

Example output (part-000003 in HDFS):

learn         -> [3]
elasticsearch -> [3]
with          -> [3, 18]
this          -> [3]
guide         -> [3]
artificial    -> [8]
intelligence  -> [8]
and           -> [8, 13, 23, 28, 48]
machine       -> [8, 38]
learning      -> [8, 38]
go            -> [8]
hand          -> [8]
in            -> [8]
cloud         -> [13]
services      -> [13]
cloud-native  -> [13]
development   -> [13]
building      -> [18]
scalable      -> [18]
data          -> [18, 28, 33, 43]
architectures -> [18]
with          -> (already included above)
nosql         -> [18]
databases     -> [18]
ai-driven     -> [23]
search        -> [23, 48]
results       -> [23]
recommendations-> [23]
big           -> [28]
technologies  -> [28]
their         -> [28, 48]
impact        -> [28]
on            -> [28]
analytics     -> [28]
science       -> [33]
is            -> [33]
the           -> [33]
future        -> [33]
of            -> [33]
technology    -> [33]
introduction  -> [38]
to            -> [38]
algorithms    -> [38, 48]
content       -> [43]
delivery      -> [43]
networks      -> [43]
cdns          -> [43]
for           -> [43]
faster        -> [43]
access        -> [43]
real-world    -> [48]
applications  -> [48]

领英推荐

?? Apache Pinot: How LinkedIn used a built-in Sketch…

Stanislav Kozlovski 1 年前

Redux-like state container in SwiftUI. Container Views.

Siarhei Liashchou 2 年前

Union Type: A Deep Dive

Deta Aditya 6 个月前

Shard 4 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard4/)

Contains terms from posts with post_id % 5 == 4 (Post 4, 9, 14, 19, ...).

Example output (part-000004 in HDFS):

data         -> [4, 14, 39]
privacy      -> [4]
and          -> [4]
security     -> [4]
in           -> [4]
modern       -> [4]
systems      -> [4, 24, 44]
scaling      -> [9]
web          -> [9]
applications -> [9]
to           -> [9]
millions     -> [9]
of           -> [9]
users        -> [9]
real-time    -> [14]
processing   -> [14]
with         -> [14, 19, 24, 34, 44]
apache       -> [14]
spark        -> [14]
optimizing   -> [19]
database     -> [19]
queries      -> [19]
indexing     -> [19]
strategies   -> [19]
building     -> [24, 34, 44]
intelligent  -> [24, 44]
search       -> [24, 29, 44, 49]
systems      -> (already included)
ai           -> [24, 44]
user         -> [29, 49]
behavior     -> [29, 49]
analysis     -> [29, 49]
for          -> [29, 39, 49]
better       -> [29, 49]
experiences  -> [29, 49]
apis         -> [34]
that         -> [34]
scale        -> [34]
your         -> [34]
application  -> [34]
engineering  -> [39]
pipelines    -> [39]
analytics    -> [39]
platforms    -> [39]

Load data into Searchers:

After the ingestion pipeline, we should have base index for each shard in the corresponding HDFS dir, we need to spin up our fleet of Searcher machines

When a searcher machine starts, it will

1) Try to register itself as one host in the Searcher machine cluster on ZooKeeper, remember for our setup, we have 5 shards and each shard has 2 replicas

2) Load the shard data into the its memory from HDFS

Serve Live Traffic:

We will build another resource called Broker that will take live search traffic, based on if the shard key is provided(post id in our case), we have 2 routing choices here

When the Post Id is NOT passed

The broker has no choice but fanout the search request to all searchers since the broker does not know which shard the search query belongs to. Each Searcher will execute the dict.get() and return a list of post ids

And broker will then merge them into 1 final doc id list

When the Post Id is passed

If the Post Id is provided, we can reuse the hash function to figure out the shard, then broker can ask Zookeeper for the list of hosts in that shard, since we have 2 replicas here, we will just choose a random host and forward the search traffic there. The Searcher on that host will execute the same dict.get() and return the doc Id

This sounds a weird case here because we happen to use post Id as doc key. In the real world example, you may use a field in your inverted index and then the doc id is just a UUID auto generated by the system

In our previous post, we also discuss that search result is not very user friendly since it's just raw doc id list. What we can do is in the broker, we have an extra step called materialization where we populate the actual content that we want to include in the SERP. This may involve calling external data storages for fetching non inverted index related information.

Last but not least:

wow, we have made huge progress so far! We partition our inverted index so it can handle 5 million posts! We add a broker layer to orchestrate the searchers and materialize our search result to be more user friendly

But there's still a lot improvements we can make in our search engine. For example, the current system leverage the offline job to build the index, which means the latest posts on LinkedIn will not get reflected immediately and you won't be able to search it until we rebuild the base index. We also don't talk about ranking, which plays an important role in Search ecosystem. Stay tuned and I may write another post for these topic!

If you have read this far, hope you enjoy my post and please share any feedback or suggestions you may have!

Cheers

Yijie

Yijie Wu的更多文章

How to bootstrap your 1st search product

2024年4月2日

How to bootstrap your 1st search product

Search functionality has become ubiquitous in today's digital landscape. If you're a founder, have you ever wonder how…

1 条评论

Building A Search Engine from Scratch - Part 2

Yijie Wu

Engineering @ Squarespace | Startup Advisor | Future VC to be

The ingestion pipeline:

Sharding Strategy:

Shard 0 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard0/)

Shard 1 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard1/)

Shard 2 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard2/)

Shard 3 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard3/)

领英推荐

Shard 4 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard4/)

Load data into Searchers:

Serve Live Traffic:

Last but not least:

Yijie Wu的更多文章

社区洞察

其他会员也浏览了

Redux-like state container in SwiftUI. Container Views.

Union Type: A Deep Dive

Using Data Science and Google Cloud to Win March Madness…Well, Not Really

How to read Graph DataBase Benchmark (Part-2)

Unlock Your Data Like Never Before: Meet Snowflake's Game-Changing Universal Search

GraphQL: The Genie of APIs ??♂?

How to bridge the gap between GA4 and BigQuery export

Crazy or Genius? Ingest data to Microsoft Fabric / OneLake via Import Semantic Models for better performance than Dataflows

How to Use Advanced Techniques in BigQuery to Optimize Your Queries

Understanding ListLen(var) vs. var.ListLen in ColdFusion

The ingestion pipeline:

Sharding Strategy:

Shard 0 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard0/)

Shard 1 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard1/)

Shard 2 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard2/)

Shard 3 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard3/)

领英推荐

Shard 4 (hdfs://post/inverted_index/baseindex/20250215/partitions/shard4/)

Load data into Searchers:

Serve Live Traffic:

Last but not least:

Yijie Wu的更多文章

How to bootstrap your 1st search product

社区洞察

其他会员也浏览了

Redux-like state container in SwiftUI. Container Views.

Union Type: A Deep Dive

Using Data Science and Google Cloud to Win March Madness…Well, Not Really

How to read Graph DataBase Benchmark (Part-2)

Unlock Your Data Like Never Before: Meet Snowflake's Game-Changing Universal Search

GraphQL: The Genie of APIs ??♂?

How to bridge the gap between GA4 and BigQuery export

Crazy or Genius? Ingest data to Microsoft Fabric / OneLake via Import Semantic Models for better performance than Dataflows

How to Use Advanced Techniques in BigQuery to Optimize Your Queries

Understanding ListLen(var) vs. var.ListLen in ColdFusion