登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

How Zomato improved its search using NLP

Arpit Bhayani

发布日期: 2023年1月18日

Search is one of the most interesting problems to attempt and Zomato has made their search understand natural language; here's a quick gist about this system ??

A simple search engine that just does a weighted search on name and description is easy to game. For example: "Best Coffee Cafe" would rank restaurants having the word "best" in their names higher.

But the actual intent of the user is to get the list of best coffee cafes near its current location.

Handling such queries requires natural language understanding. On Zomato, the search queries can be classified into 3 categories

Dish + Dish - chai and samosa
Restaurant + Dish - mcd burger
Restaurant/Dish + near me/best/some text - pizza near me

Training the model

We train a Neural Network with domain data that helps us understand the different entities present in the search query; and for this, we leverage

Word2Vec
Byte-pair Encoding, and
LSTMs.

Word2Vec

Word2Vec helps in generating word embeddings i.e. vector representation of a word such that the weights in the vector mean something as per the corpus.

Documents are tokenized and passed as inputs to Word2Vec. So a restaurant name "Domino's Pizza" should be passed as tokens "Dominos" and "Pizza". But how do we tokenize?

Byte-pair Encoding

Tokenizing the document on simple spaces won't work well because, in the Food domain, we see some words appear together more frequently than others. Ex: "Cheeze Pizza".

To train Word2Vec better, we would prefer, "Cheeze Pizza" to be considered as one token instead of "Cheese" and "pizza" as two; because in the end, these will be our entities.

This requires us to do a supervised tokenization and we leverage an algorithm called Byte-pair Encoding. It is a really simple supervised algorithm that does a great job at tokenizing the text as per the corpus.

The algorithm just works by merging the most frequent subtokens and creating new amalgamated tokens.

For example, BPE enables us to tokenize "Friedrice" as "Fried" and "Rice" which would not be possible if we just split by space.

Sequence Tagging

The tokens extracted using Byte-pair Encoding are used as a vocabulary to generate word embeddings which are used to train a Neural Network to understand Named Entities using Bidirectional LSTM.

With this network, we could process the text "Jack's Aaloo Tikki Burger" and get

Jack's is a Restaurant
Aaloo Tikki Burger is a Dish

Architecture

The data about Restaurants, Food, and Locations is ingested to train the model. The model is loaded in a lightweight API server and served through an API Gateway.

The Search service upon getting the search request makes a call to this API that responds with extracted - Dish, Restaurant, and Intent.

The information is then used to formulate an Elasticsearch Query to get the search results. These results are then streamed back to the user and rendered on their applications.

Here's the video of my explaining this in-depth ?? do check it out

Thank you so much for reading ?? If you found this helpful, do spread the word about it on social media; it would mean the world to me.

If you liked this short essay, you might also like my courses and playlists on

System Design
Redis Internals
BitTorrent Internals
Hash Table Internals
Designing Microservices
GitHub Outage Dissections

I teach an interactive course on System Design where you'll learn how to intuitively design scalable systems. The course will help you

become a better engineer
ace your technical discussions
get you acquainted with a spectrum of topics ranging from Storage Engines, High-throughput systems, to super-clever algorithms behind them.

I have compressed my ~10 years of work experience into this course, and aim to accelerate your engineering growth 100x. To date, the course is trusted by 1000+ engineers from 11 different countries and here you can find what they say about the course.

Together, we will dissect and build some amazing systems and understand the intricate details. You can find the week-by-week curriculum and topics, testimonials, and other information at https://arpitbhayani.me/masterclass.

Arpit's Newsletter

108,439 位关注者

Aditi N.

Angular Developer | Django Rest API Developer | Python Developer

1 年

Pradeep Shejule

Aditya Soni

Data | Software & Machine Learning Engineering | Kaggle Competitions Master

1 年

We can also start generating more data by building a dish taxonomy (similar to what one has on any given ecommerce website) and then using something like LSH to tag new dishes to the existing taxonomy, it's super fast to run once you have the index built. And dish name normalisation is a VERY HARD problem, I can say it from my experience which I had while I was working in F&B sector. Thanks!

Niharika Ganpule

1 年

Your way of explaining the concepts makes it very easy to absorb even for non tech readers. Thanks Arpit Bhayani for sharing knowledge in the most effective way!

1 次回应

Gaurav Pandey

[Experienced Senior Software Engineer | Expert in Python, Django, Kubernetes | Generative AI Architect | Agile Leader Delivering Efficiency Gains, Boosting Productivity, and Enhancing Customer Satisfaction]

1 年

one of good ways of entity extraction. Thanks Arpit Bhayani for making it fun.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How Zomato improved its search using NLP

Arpit Bhayani

Training the model

Word2Vec

Byte-pair Encoding

Sequence Tagging

Architecture

Arpit's Newsletter

108,439 位关注者

更多精彩文章

社区洞察

Training the model

Word2Vec

Byte-pair Encoding

Sequence Tagging

Architecture

Arpit's Newsletter

108,439 位关注者

The best resource does not exist.

2024年9月22日

It's not about what you know, but about how you think

2024年9月8日

Roadmaps are just satisfying your urge to follow a syllabus

2024年8月18日

Always negotiate the offer you get

2024年8月11日

Proving your Culture Fit

2024年8月4日

Premature Abstractions

2024年7月28日

Tip the scale in your favor in interviews

2024年7月21日

7 questions that you should ask your interviewer

2024年7月14日

Traits of a 10x engineer

2024年7月7日

How PostgreSQL stores data in files, called forks

2024年6月30日

社区洞察