Natural Language Data Search
Remember how search was tedious a decade ago! Today you can search and ask questions in any search engine as you would talk to All knowing Guru. Use of natural language to search is normal and taken for granted, but ever wondered what allows search engines to search thru billions and trillions of data on internet seamlessly?
Semantic Search is your answer. In this article I will try and articulate what is semantic search and will create a small demo of how you can create your own version of semantic search.
What is Semantic Search?
Short answer: Search with meaning
i.e., Understand the intent of the question, not just key word match or their sequence of occurrence by using regular expressions. Semantic search also enables learning from previous search and links & contexts by using BM25, latent semantic indexing, cosine similarity, bi-encoding, cross-encoding and other such transfer learning techniques. It allows user to search as they would speak by using verbs and filler words.
Semantic search is possible by considering three major attributes in design architecture, i.e.,
For demo i will use transformers library and model "sentence-transformers/msmarco-distilbert-base-dot-prod-v3". This model is a dot-product model and is well suited for asymmetric semantic search problems. Asymmetric search is getting response which is a different in length than the search query. This model satisfies the need for encoders.
since we are using BERT based foundational model from transformers, it would lead to huge embedding data, FAISS, Facebook library for quick search for embeddings of documents is a good choice. FAISS solves the traditional query limitation search engine's that optimized for hash-based search.
Below is a code snippet for sample semantic search model:
领英推荐
pip?install?transformers
pip?install?-U?sentence-transformers
pip?install?faiss-gpu
from sentence_transformers import SentenceTransformer
import faiss
def dataread (dataframe_idx):
info = df.iloc[dataframe_idx]
meta_dict = dict()
meta_dict['ind'] = info['ind']
meta_dict['desc'] = info['desc'][:500]
return meta_dict
def search(query, top_k, index, model):
t=time.time()
query_vector = model.encode([query])
top_k = index.search(query_vector, top_k)
top_k_ids = top_k[1].tolist()[0]
top_k_ids = list(np.unique(top_k_ids))
results = [dataread(idx) for idx in top_k_ids]
return results
df = ['Absence?of?sanity',
?????????????'Lack?of?saneness',
?????????????'A?man?is?eating?food.',
?????????????'A?man?is?eating?a?piece?of?bread.',
?????????????'The?girl?is?carrying?a?baby.',
?????????????'A?man?is?riding?a?horse.',
?????????????'A?woman?is?playing?violin.',
?????????????'Two?men?pushed?carts?through?the?woods.',
?????????????'A?man?is?riding?a?white?horse?on?an?enclosed?ground.',
?????????????'A?monkey?is?playing?drums.',
?????????????'A?cheetah?is?running?behind?its?prey.']?
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3'
encoded_data?=?model.encode(df.Plot.tolist()
encoded_data?=?np.asarray(encoded_data.astype('float32'))
index?=?faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data,?np.array(range(0,?len(df))))
faiss.write_index(index,?'data_set.index'))
query?=?'Nobody?has?sane?thoughts'?#@param?{type:?'string'}
number_top_matches?=?5?#@param?{type:?"number"}
results=search(query, top_k=number_top_matches, index=index, model=model))
print("\n\n======================\n\n
print("Query:",?query)
print("\n"))
for result in results:
print('\t',result)"
As you can see thru the code creating the semantic search is simplified using the transformers foundational library. Hope this gives you insight into how semantic search architecture works in its most simplified format.
Peace out!