Natural Language Data Search

Natural Language Data Search

Remember how search was tedious a decade ago! Today you can search and ask questions in any search engine as you would talk to All knowing Guru. Use of natural language to search is normal and taken for granted, but ever wondered what allows search engines to search thru billions and trillions of data on internet seamlessly?

No alt text provided for this image

Semantic Search is your answer. In this article I will try and articulate what is semantic search and will create a small demo of how you can create your own version of semantic search.

What is Semantic Search?

Short answer: Search with meaning

i.e., Understand the intent of the question, not just key word match or their sequence of occurrence by using regular expressions. Semantic search also enables learning from previous search and links & contexts by using BM25, latent semantic indexing, cosine similarity, bi-encoding, cross-encoding and other such transfer learning techniques. It allows user to search as they would speak by using verbs and filler words.

Semantic search is possible by considering three major attributes in design architecture, i.e.,

  1. search query encoder
  2. Embedding storage
  3. database encoder

For demo i will use transformers library and model "sentence-transformers/msmarco-distilbert-base-dot-prod-v3". This model is a dot-product model and is well suited for asymmetric semantic search problems. Asymmetric search is getting response which is a different in length than the search query. This model satisfies the need for encoders.

since we are using BERT based foundational model from transformers, it would lead to huge embedding data, FAISS, Facebook library for quick search for embeddings of documents is a good choice. FAISS solves the traditional query limitation search engine's that optimized for hash-based search.

Below is a code snippet for sample semantic search model:


pip?install?transformers
pip?install?-U?sentence-transformers
pip?install?faiss-gpu

from sentence_transformers import SentenceTransformer
import faiss

def dataread (dataframe_idx):
    info = df.iloc[dataframe_idx]
    meta_dict = dict()
    meta_dict['ind'] = info['ind']
    meta_dict['desc'] = info['desc'][:500]
    return meta_dict
    
def search(query, top_k, index, model):
    t=time.time()
    query_vector = model.encode([query])
    top_k = index.search(query_vector, top_k)
    top_k_ids = top_k[1].tolist()[0]
    top_k_ids = list(np.unique(top_k_ids))
    results =  [dataread(idx) for idx in top_k_ids]
    return results


df = ['Absence?of?sanity',
    ?????????????'Lack?of?saneness',
    ?????????????'A?man?is?eating?food.',
    ?????????????'A?man?is?eating?a?piece?of?bread.',
    ?????????????'The?girl?is?carrying?a?baby.',
    ?????????????'A?man?is?riding?a?horse.',
    ?????????????'A?woman?is?playing?violin.',
    ?????????????'Two?men?pushed?carts?through?the?woods.',
    ?????????????'A?man?is?riding?a?white?horse?on?an?enclosed?ground.',
    ?????????????'A?monkey?is?playing?drums.',
    ?????????????'A?cheetah?is?running?behind?its?prey.']?
        


model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3'


encoded_data?=?model.encode(df.Plot.tolist()
encoded_data?=?np.asarray(encoded_data.astype('float32'))
index?=?faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data,?np.array(range(0,?len(df))))
faiss.write_index(index,?'data_set.index'))

query?=?'Nobody?has?sane?thoughts'?#@param?{type:?'string'}
number_top_matches?=?5?#@param?{type:?"number"}

results=search(query, top_k=number_top_matches, index=index, model=model))
        


print("\n\n======================\n\n
print("Query:",?query)
print("\n"))
for result in results:
    print('\t',result)"        
No alt text provided for this image
Results

As you can see thru the code creating the semantic search is simplified using the transformers foundational library. Hope this gives you insight into how semantic search architecture works in its most simplified format.


Peace out!

要查看或添加评论,请登录

Eeswar C.的更多文章

  • In-Context Learning

    In-Context Learning

    Have you ever encountered instances where ChatGPT repeatedly provides similar responses to your queries, or where its…

    1 条评论
  • Retrieval Augumented Generation

    Retrieval Augumented Generation

    Anyone within the industry who has utilized ChatGPT for business purposes would likely have had the thought, "This is…

  • Diffusion Model - Gen AI

    Diffusion Model - Gen AI

    Diffusion models have gained attention for their ability to handle various tasks, particularly in the domains of image…

  • Anomaly Detection with VAE

    Anomaly Detection with VAE

    Anomaly detection is a machine learning technique used to identify patterns that are considered unusual or out of the…

  • Neural Network

    Neural Network

    In this article I am going back to the basics, Neural Networks! Most of the readers must have seen the picture above…

  • BERT - Who?

    BERT - Who?

    BERT - Bidirectional Encoder Representations from Transformers, isn’t that a tongue twister! 5 years ago, google…

  • How Does my Iphone know its me?

    How Does my Iphone know its me?

    Ever wondered how does iPhone know its you and never mistakes someone else for you when using Face Detection? Drum Roll…

    1 条评论
  • Machine Learning & Data Privacy

    Machine Learning & Data Privacy

    Every person i know fears about how their personal data is at risk by all the AI/ML that is surrounding them, whether…

  • Business at center of Data Science

    Business at center of Data Science

    Any one who has participated in brainstroming & whiteboarding sessions would agree that, what data scientists think of…

  • Capsule Networks (#capsnets)

    Capsule Networks (#capsnets)

    In my previous article on Handwriting Decoder (#ocr), we touched on how can we read Hand Writing using Computer vision.…

社区洞察

其他会员也浏览了