On Semantic Search
Pratik Bhavsar
?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG
It took me a long time to realise that search is the biggest problem in NLP. Just look at Google, Amazon and Bing. These are multi-billion dollar businesses possible only due to their powerful search engines.
My initial thoughts on search were centered around unsupervised ML, but I participated in Microsoft Hackathon 2018 for Bing and came to know the various ways a search engine can be made with deep learning.
Classical search engines
The process of search can be broken down into 4 steps:
- Query autocompletion — Suggest query based on first characters typed
- Query filtering — Token removal, stemming and lowering
- Query augmentation — Adding synonyms and acronym contraction/expansion
- Document scoring — Score(document | query) as per scoring mechanism which is mostly BM25
Now without spending time on explaining these steps, I will start discussing the shortcomings of a classical search engine such as Lucene which is the most popular search engine.
Problem 1: Token matching
Imagine I am interested in finding the best book on Backpropagation. As per the user reviews, Deep Learning by Ian Goodfellow et al. is considered to be the best on the topic and others surrounding it. But there is a complete mismatch of words between the Query: Backpropagation and Document title: Deep learning.
These are the results of amazon.com. The deep learning book isn’t there!
Although if I search for deep learning, I get the book at the top.
This is the problem of hard token matching.
Problem 2: Contextualisation
The above example works with query deep learning. What if I like reading books with practical examples instead of just reading the theory. This brings us to the topic of contextual search. In that case, these books were perfect for me. Isn’t it?
And why thy hell I see books on NLP(Neuro-linguistic programming) when I search NLP! Contextual search can solve this — If the search engine knows that I buy books on computer science, it would have shown me books on natural language processing instead.
And I get these when I search GAN. Again an issue of non-personalisation.
Problem 3: Query misunderstanding
Query: x’s influence on y
First scholarly article result: a’s influence on x
i.e Rather than finding Bernhard’s influence on academic, the first paper is about Herbart’s influence on Bernhard.
Because the token match engine doesn’t regard for the sequence of words, it can throw wrong results. ??
Although, Google’s similar query suggestion is better! ??
Problem 4: Image search
Last but not the least, the only way we can search for images by text is by generating metadata of all the images with descriptions or tags — which is practically impossible.
What is the effect on our metric?
Because of this, our metric gets affected adversely.
Hard token match → LESS RECALL → Bad user experience and fewer sales
Absence of context → LESS PRECISION → Bad user experience and fewer sales
Deep learning for search ??
Now that you have understood the problems associated with just token matching, we can discuss how to do search using deep learning. My thoughts are based on the book Deep learning for search by Tommaso Teofili.
You can read the rest of the article on my medium post.
Helping Leaders Build AI-Driven Products That Customers Love | 12+ Years | Helped 10+ Companies Secure $15M+ | Now with AI Agents ??
5 年It is really interesting post.
AI augmented Data business category creator | business, product, & tech leader
5 年Great write up Pratik! Keep them coming.