On Semantic Search
Photo from Unsplash

On Semantic Search

It took me a long time to realise that search is the biggest problem in NLP. Just look at Google, Amazon and Bing. These are multi-billion dollar businesses possible only due to their powerful search engines.

My initial thoughts on search were centered around unsupervised ML, but I participated in Microsoft Hackathon 2018 for Bing and came to know the various ways a search engine can be made with deep learning.


Classical search engines

The process of search can be broken down into 4 steps:

  1. Query autocompletion — Suggest query based on first characters typed
  2. Query filtering — Token removal, stemming and lowering
  3. Query augmentation — Adding synonyms and acronym contraction/expansion
  4. Document scoring — Score(document | query) as per scoring mechanism which is mostly BM25

Now without spending time on explaining these steps, I will start discussing the shortcomings of a classical search engine such as Lucene which is the most popular search engine.

Problem 1: Token matching

Imagine I am interested in finding the best book on Backpropagation. As per the user reviews, Deep Learning by Ian Goodfellow et al. is considered to be the best on the topic and others surrounding it. But there is a complete mismatch of words between the Query: Backpropagation and Document title: Deep learning.

These are the results of amazon.com. The deep learning book isn’t there!

No alt text provided for this image

Although if I search for deep learning, I get the book at the top.

No alt text provided for this image
This is the problem of hard token matching.

Problem 2: Contextualisation

The above example works with query deep learning. What if I like reading books with practical examples instead of just reading the theory. This brings us to the topic of contextual search. In that case, these books were perfect for me. Isn’t it?

No alt text provided for this image
No alt text provided for this image

And why thy hell I see books on NLP(Neuro-linguistic programming) when I search NLP! Contextual search can solve this — If the search engine knows that I buy books on computer science, it would have shown me books on natural language processing instead.

No alt text provided for this image

And I get these when I search GAN. Again an issue of non-personalisation.

No alt text provided for this image

Problem 3: Query misunderstanding

Query: x’s influence on y

First scholarly article result: a’s influence on x

i.e Rather than finding Bernhard’s influence on academic, the first paper is about Herbart’s influence on Bernhard.

No alt text provided for this image

Because the token match engine doesn’t regard for the sequence of words, it can throw wrong results. ??

Although, Google’s similar query suggestion is better! ??

No alt text provided for this image

Problem 4: Image search

Last but not the least, the only way we can search for images by text is by generating metadata of all the images with descriptions or tags — which is practically impossible.


What is the effect on our metric?

Because of this, our metric gets affected adversely.

Hard token match → LESS RECALL → Bad user experience and fewer sales

Absence of context → LESS PRECISION → Bad user experience and fewer sales


Deep learning for search ??

Now that you have understood the problems associated with just token matching, we can discuss how to do search using deep learning. My thoughts are based on the book Deep learning for search by Tommaso Teofili.

You can read the rest of the article on my medium post.

Vignesh Prajapati

Helping Leaders Build AI-Driven Products That Customers Love | 12+ Years | Helped 10+ Companies Secure $15M+ | Now with AI Agents ??

5 年

It is really interesting post.

回复
Ashok Subramanian

AI augmented Data business category creator | business, product, & tech leader

5 年

Great write up Pratik! Keep them coming.

回复

要查看或添加评论,请登录

Pratik Bhavsar的更多文章

  • 5 Rules of getting a job in data science

    5 Rules of getting a job in data science

    This is inspired by Jordan Peterson's practical advice book "12 rules for life" First - Know the rules Understand the…

    6 条评论
  • Mental Models In Data Science

    Mental Models In Data Science

    Slow-thinking vs fast-thinking From Google’s 43 rules of ML “Rule #4: Keep the first model simple and get the…

  • On Variety Of Encoding Text

    On Variety Of Encoding Text

    Encoding text is at the heart of understanding language. If we know how to represent words, sentences and paragraphs…

    4 条评论
  • How to train your Neural Network

    How to train your Neural Network

    The value of a neural network lies in its hypertuning. General intuition The VA(validation accuracy) of your NN(Neural…

    6 条评论

社区洞察

其他会员也浏览了