Exploring the Future of Semantic Search: Hugging Face Leaderboard Model vs. Langchain with OpenAI
Ahmed Fayed Elnahel
Digital Transformation | Enterprise Architecture | Data Science
Introduction
In the ever-evolving landscape of artificial intelligence and natural language processing
In this article, we delve into how semantic search, the Hugging Face Leaderboard, and OpenAI's innovations intersect with Milvus VectorDB, a high-performance vector database designed to support similarity search at scale.
Initializing the environment
Install Milvus database following the steps in https://milvus.io/docs/install_standalone-docker.md
Install sentence transformers and Milvus libraries using,
pip install sentence-transformers
pip install protobuf==3.20.0
pip install grpcio-tools
pip install pymilvus
Customs commodity codes dataset
An HSCODE dataset is a valuable resource for organizations and individuals involved in international trade and customs compliance. It contains a comprehensive list of Customs Harmonized Codes (HSCODEs), which are standardized codes used to classify and categorize various products and goods for customs and trade purposes. Each HSCODE corresponds to a a group of products or commodities, and the dataset typically includes detailed descriptions of these HSCODEs.
The Customs commodity codes datasets can be downloaded from Dubai Open Data platform under https://www.dubaipulse.gov.ae/organisation/dubai-customs/service/dc-records
1- Hugging Face LLMs
Hugging Face, a company that has been at the forefront of developing and sharing state-of-the-art NLP models, has consistently maintained a position of leadership in the field.
The Hugging Face leader-board for Sentence similarity models can be found here https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads
Utilizing "sentence-transformers/all-MiniLM-L6-v2" embedding the "HSCODE descriptions" for similarity searches. This versatile language model adeptly converts customs Harmonized Code (HSCODE) descriptions into numerical representations that encapsulate their semantic essence. Through generating embeddings for descriptions and user queries, similarity assessments
MilvusDB Schema
Starting with creating an Milvus schema for the HSCODE information,
from sklearn import preprocessing
from pymilvus import CollectionSchema, FieldSchema, DataType
id = FieldSchema(
name="row_id",
dtype=DataType.INT64,
is_primary=True,
)
hscode = FieldSchema(
name="hscode",
dtype=DataType.INT64, #DOUBLE
max_length=200,
# The default value will be used if this field is left empty during data inserts or upserts.
# The data type of default_value must be the same as that specified in dtype.
default_value="Unknown"
)
description = FieldSchema(
name="description",
dtype=DataType.VARCHAR,
max_length=1000,
default_value="Unknown"
)
section = FieldSchema(
name="section",
dtype=DataType.VARCHAR,
max_length=300,
default_value="Unknown"
)
chapter = FieldSchema(
name="chapter",
dtype=DataType.VARCHAR,
max_length=500,
default_value="Unknown"
)
heading = FieldSchema(
name="heading",
dtype=DataType.VARCHAR,
max_length=1000,
default_value="Unknown"
)
description_vector = FieldSchema(
name="description_vector",
dtype=DataType.FLOAT_VECTOR,
dim=384
)
schema = CollectionSchema(
fields=[id, hscode, description,section,chapter,heading, description_vector],
description="hscodes",
enable_dynamic_field=True
)
collection_name = "hscode_MiniLM_L6"
Next is to connect to the Milvus database and create a new collection based on the defined schema
from pymilvus import connections
from pymilvus import Collection
con = connections.connect(
alias="default",
user='username',
password='password',
host='DB host',
port='19530'
)
collection = Collection(
name=collection_name,
schema=schema,
using='default',
shards_num=2
)
By inputting the data and establishing a fresh index using the COSINE similarity function, I experimented with various other similarity function choices and determined that the COSINE similarity function consistently delivers the most favorable outcomes.
from pymilvus import Collection
import pandas as pd
df = pd.read_csv("./HSCodeComplete.csv")
df.head()
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = df['Description'].values
hscodes = df['HSCode'].values
headings = df['Heading'].values
chapters = df['Chapter'].values
sections = df['Section'].values
sentence_embeddings = model.encode(sentences)
collection = Collection("hscode_MiniLM_L6") # Get an existing collection.
import random
#import the data with random id
data = [
[i for i in range(7470)],
hscodes,
sentences,
sections,
chapters,
headings,
sentence_embeddings
]
data
mr = collection.insert(data)
领英推荐
When employing a tricky query like "apple iPhone" to search the data, the results can vary, potentially encompassing food-related or electronic-related items, contingent on the proficiency of the language model,
import json
collection = Collection("hscode_MiniLM_L6")
collection.load()
search_params = {
"metric_type":"COSINE", #"COSINE",
"offset": 0,
"ignore_growing": False,
"params": {"nprobe": 10}
}
search_items = ["apple iphone" ]
search_embeddings = model.encode(search_items)
search_embeddings.shape
results = collection.search(
data=search_embeddings,
anns_field="description_vector",
# the sum of `offset` in `param` and `limit`
# should be less than 16384.
param=search_params,
limit=5,
expr=None,
# set the names of the fields you want to
# retrieve from the search result.
output_fields=['hscode', 'description'], #,
consistency_level="Strong"
)
data = []
for hits in iter(results):
# print(str(hits))
jdata = json.loads(str(hits))
data.append(jdata)
data
The top 5 results are as following,
[["id: 6625, distance: 0.3133668601512909, entity: {'hscode': 85437031, 'description': 'Electronic Cigarettes'}",
"id: 6953, distance: 0.308815598487854, entity: {'hscode': 90064000, 'description': 'Instant print cameras.'}",
"id: 6451, distance: 0.29862725734710693, entity: {'hscode': 85171200, 'description': 'Telephones for cellular networks or for other wireless networks'}",
"id: 6980, distance: 0.2906135320663452, entity: {'hscode': 90132090, 'description': 'Other devices, appliances and instruments :'}",
"id: 923, distance: 0.28824496269226074, entity: {'hscode': 13021940, 'description': 'Aloes.'}"]]
The third results is the correct one, and all results are related to telephone and electronics which is the correct representation to the searched commodity.
2- Langchain with OpenAI embedding model
Langchain, harnessing the power of OpenAI's cutting-edge embedding model, marks a transformative leap in the field of natural language processing and understanding. Whether it's for content recommendation, sentiment analysis, or semantic search, Langchain's integration with OpenAI's model ensures that the richness of language is harnessed to its fullest.
Initializing the environment
Install langchain library into your Jupyter Notebook
pip install langchain
Data Preparation
Here we will use langchain direct integration with Milvus database to insert the data into new collection. However, the API is not flexible enough to insert more columns into the collection.
replace with your OpenAI API key and MilvusDb address.
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader
sentences = df['Description'].values
embeddings_model = OpenAIEmbeddings(openai_api_key="OpenAI API key", model="text-embedding-ada-002")
vector_db = Milvus.from_texts(
sentences,
embeddings_model,
drop_old = True,
collection_name = 'hscode_openai',
connection_args={"host": "Host Address", "port": "19530"},
)
Search Results
Now we try with the same test sentence and observe the results,
docs = vector_db.similarity_search_with_score("apple iphone mobile", k=5)
docs
The code generates the top five similarity to the passed input.
[(Document(page_content='Apples, fresh or chilled.'), 0.36462166905403137),
(Document(page_content='Charger for mobile phone and tablets'),
0.3666340708732605),
(Document(page_content='headphones for line telephones'),
0.39265894889831543),
(Document(page_content='Telephones for cellular networks or for other wireless networks'),
0.39536523818969727),
(Document(page_content='Apple juice, of a Brix value exceeding 20.'),
0.39887306094169617)]
Here we can notice the model confused the electronics with food products, still it gave the fourth result as the correct one.
Conclusion
While the initial Hugging Face model provided more relevant search results, opting for Langchain with OpenAI offers distinct advantages, including the avoidance of on-premises hosting of the LLM model and seamless integration with the Milvus Database.
When it comes to Langchain, it's important to note that there are certain limitations associated with defining additional columns within the designated collection. There are constraints limit flexibility in altering the underlying similarity function. Additionally, there's a cost associated with making embedding calls to the OpenAI models.
Co-founder, COO Pigro - Power up your workspace with Pigro website: pigro.ai
1 年Building a solution that works for every application is hard. We recently released a solution to split documents into optimal chunks of text. We split PDF and Office files based on the original document structure and content semantics.
Engineering Leader. Architect. Love Physics, Math, Programming Languages, FP, AI & ML.
1 年Very nice practical intro with working examples of playing with LLM's indeed Ahmed Fayed Elnahel. We too had very interesting adventures in the field of "Semantic" Search when I was working for one of India's largest e-pharmacy cos... (Not?) surprisingly, some of the ML-based models easily began to beat the traditional "Lexical" search algorithms in terms of ATC & other metrics! The future is definitely filled with interesting possibilities with AI/ML...!
Solutions Architect | Technical Project Manager | Engineering Lead | Digital Transformation | IT Systems Expert
1 年Good Job Eng Ahmed, I am wondering if you test the model on Arabic data and what is the performance of the model in this case
Functional Lead in Customs & Sea Port Digital Transformation | e-Commerce | Supply Chain | Data Analysis & Governance | AI Effectiveness | Oracle 23ai | Requirements Mgmt | Agile SAFe | Product Lead Workflow patterns
1 年Valuable insights?intersection of semantic search?with ( Hugging Face Leaderboard?+ Open AI)