登录查看更多内容

On Semantic Search

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

发布日期: 2019年12月2日

It took me a long time to realise that search is the biggest problem in NLP. Just look at Google, Amazon and Bing. These are multi-billion dollar businesses possible only due to their powerful search engines.

My initial thoughts on search were centered around unsupervised ML, but I participated in Microsoft Hackathon 2018 for Bing and came to know the various ways a search engine can be made with deep learning.

Classical search engines

The process of search can be broken down into 4 steps:

Query autocompletion — Suggest query based on first characters typed
Query filtering — Token removal, stemming and lowering
Query augmentation — Adding synonyms and acronym contraction/expansion
Document scoring — Score(document | query) as per scoring mechanism which is mostly BM25

Now without spending time on explaining these steps, I will start discussing the shortcomings of a classical search engine such as Lucene which is the most popular search engine.

Problem 1: Token matching

Imagine I am interested in finding the best book on Backpropagation. As per the user reviews, Deep Learning by Ian Goodfellow et al. is considered to be the best on the topic and others surrounding it. But there is a complete mismatch of words between the Query: Backpropagation and Document title: Deep learning.

These are the results of amazon.com. The deep learning book isn’t there!

Although if I search for deep learning, I get the book at the top.

This is the problem of hard token matching.

Problem 2: Contextualisation

The above example works with query deep learning. What if I like reading books with practical examples instead of just reading the theory. This brings us to the topic of contextual search. In that case, these books were perfect for me. Isn’t it?

And why thy hell I see books on NLP(Neuro-linguistic programming) when I search NLP! Contextual search can solve this — If the search engine knows that I buy books on computer science, it would have shown me books on natural language processing instead.

And I get these when I search GAN. Again an issue of non-personalisation.

Problem 3: Query misunderstanding

Query: x’s influence on y

First scholarly article result: a’s influence on x

i.e Rather than finding Bernhard’s influence on academic, the first paper is about Herbart’s influence on Bernhard.

Because the token match engine doesn’t regard for the sequence of words, it can throw wrong results. ??

Although, Google’s similar query suggestion is better! ??

Problem 4: Image search

Last but not the least, the only way we can search for images by text is by generating metadata of all the images with descriptions or tags — which is practically impossible.

What is the effect on our metric?

Because of this, our metric gets affected adversely.

Hard token match → LESS RECALL → Bad user experience and fewer sales

Absence of context → LESS PRECISION → Bad user experience and fewer sales

Deep learning for search ??

Now that you have understood the problems associated with just token matching, we can discuss how to do search using deep learning. My thoughts are based on the book Deep learning for search by Tommaso Teofili.

You can read the rest of the article on my medium post.

Vignesh Prajapati

Helping Leaders Build AI-Driven Products That Customers Love | 12+ Years | Helped 10+ Companies Secure $15M+ | Now with AI Agents ??

5 年

It is really interesting post.

Ashok Subramanian

AI augmented Data business category creator | business, product, & tech leader

5 年

Great write up Pratik! Keep them coming.

查看更多评论

要查看或添加评论，请登录

Pratik Bhavsar的更多文章

5 Rules of getting a job in data science

2020年6月8日

5 Rules of getting a job in data science

This is inspired by Jordan Peterson's practical advice book "12 rules for life" First - Know the rules Understand the…

6 条评论
Mental Models In Data Science

2020年3月11日

Mental Models In Data Science

Slow-thinking vs fast-thinking From Google’s 43 rules of ML “Rule #4: Keep the first model simple and get the…
On Variety Of Encoding Text

2019年12月10日

On Variety Of Encoding Text

Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs…

4 条评论
How to train your Neural Network

2017年8月4日

How to train your Neural Network

The value of a neural network lies in its hypertuning. General intuition The VA(validation accuracy) of your NN(Neural…

6 条评论

On Semantic Search

Pratik Bhavsar

?? Build reliable agents / Check out -> Agent Leaderboard, Hallucination Index, BRAG

Classical search engines

Problem 1: Token matching

Problem 2: Contextualisation

Problem 3: Query misunderstanding

Problem 4: Image search

What is the effect on our metric?

Deep learning for search ??

Pratik Bhavsar的更多文章

社区洞察

其他会员也浏览了

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

5 Essential Free Tools for Getting Started with LLMs

NLP ? AI Text Detection Techniques

The Marvels of Large Language Models: A Deep Dive into the Future of NLP

Everything to get started with NLP

Day 2: The History of NLP

Deep Dive into Hugging Face: Open Source Powerhouse for NLP

Semantic Vector Search: Improving Twitter Search with Vector-Based NLP - Baking AI

Classical search engines

Problem 1: Token matching

Problem 2: Contextualisation

Problem 3: Query misunderstanding

Problem 4: Image search

What is the effect on our metric?

Deep learning for search ??

Pratik Bhavsar的更多文章

5 Rules of getting a job in data science

Mental Models In Data Science

On Variety Of Encoding Text

How to train your Neural Network

社区洞察

其他会员也浏览了

Word Embedding: Unveiling the Hidden Semantics of Words

Discover Your Westeros Legacy: Use NLP to Find Your Affiliation with the 7 Great Houses

Foundational Papers in NLP: Bi-Directional Attention Flow (BIDAF) network - Seo et al 2016.

5 Essential Free Tools for Getting Started with LLMs

NLP ? AI Text Detection Techniques

The Marvels of Large Language Models: A Deep Dive into the Future of NLP

Everything to get started with NLP

Day 2: The History of NLP

Deep Dive into Hugging Face: Open Source Powerhouse for NLP

Semantic Vector Search: Improving Twitter Search with Vector-Based NLP - Baking AI