Exploring the Future of Semantic Search: Hugging Face  Leaderboard Model vs. Langchain with OpenAI
istockphoto.com

Exploring the Future of Semantic Search: Hugging Face Leaderboard Model vs. Langchain with OpenAI

Introduction

In the ever-evolving landscape of artificial intelligence and natural language processing, key players have emerged as frontrunners, each contributing significantly to the development of advanced technologies. Hugging Face, renowned for its pioneering work in language models and model deployment, and OpenAI, the organization behind the revolutionary GPT-3 and GPT-4 models, are leaders in this dynamic field. Additionally, semantic search has become a pivotal application in the realm of AI, offering enhanced information retrieval and comprehension. This article explores the dynamic interplay of these elements: semantic search with the Hugging Face Leaderboard language models, and OpenAI's contributions, utilizing the power of Milvus VectorDB.

In this article, we delve into how semantic search, the Hugging Face Leaderboard, and OpenAI's innovations intersect with Milvus VectorDB, a high-performance vector database designed to support similarity search at scale.

Initializing the environment

Install Milvus database following the steps in https://milvus.io/docs/install_standalone-docker.md

Install sentence transformers and Milvus libraries using,

pip install sentence-transformers
pip install protobuf==3.20.0
pip install grpcio-tools
pip install pymilvus        

Customs commodity codes dataset

An HSCODE dataset is a valuable resource for organizations and individuals involved in international trade and customs compliance. It contains a comprehensive list of Customs Harmonized Codes (HSCODEs), which are standardized codes used to classify and categorize various products and goods for customs and trade purposes. Each HSCODE corresponds to a a group of products or commodities, and the dataset typically includes detailed descriptions of these HSCODEs.

HSCODE dataset sample records

The Customs commodity codes datasets can be downloaded from Dubai Open Data platform under https://www.dubaipulse.gov.ae/organisation/dubai-customs/service/dc-records

1- Hugging Face LLMs

Hugging Face, a company that has been at the forefront of developing and sharing state-of-the-art NLP models, has consistently maintained a position of leadership in the field.

The Hugging Face leader-board for Sentence similarity models can be found here https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads

Utilizing "sentence-transformers/all-MiniLM-L6-v2" embedding the "HSCODE descriptions" for similarity searches. This versatile language model adeptly converts customs Harmonized Code (HSCODE) descriptions into numerical representations that encapsulate their semantic essence. Through generating embeddings for descriptions and user queries, similarity assessments, like cosine similarity, can be employed.

MilvusDB Schema

Starting with creating an Milvus schema for the HSCODE information,

from sklearn import preprocessing
       
from pymilvus import CollectionSchema, FieldSchema, DataType

id = FieldSchema(

  name="row_id",

  dtype=DataType.INT64,

  is_primary=True,

)

hscode = FieldSchema(

  name="hscode",

  dtype=DataType.INT64, #DOUBLE

  max_length=200,

  # The default value will be used if this field is left empty during data inserts or upserts.

  # The data type of default_value must be the same as that specified in dtype.

  default_value="Unknown"

)

description = FieldSchema(

  name="description",

  dtype=DataType.VARCHAR,

  max_length=1000,

  default_value="Unknown"

)

section = FieldSchema(

  name="section",

  dtype=DataType.VARCHAR,

  max_length=300,

  default_value="Unknown"

)

chapter = FieldSchema(

  name="chapter",

  dtype=DataType.VARCHAR,

  max_length=500,

  default_value="Unknown"

)

heading = FieldSchema(

  name="heading",

  dtype=DataType.VARCHAR,

  max_length=1000,

  default_value="Unknown"

)

description_vector = FieldSchema(

  name="description_vector",

  dtype=DataType.FLOAT_VECTOR,

  dim=384

)

schema = CollectionSchema(

  fields=[id, hscode, description,section,chapter,heading, description_vector],

  description="hscodes",

  enable_dynamic_field=True

)

collection_name = "hscode_MiniLM_L6"        

Next is to connect to the Milvus database and create a new collection based on the defined schema

from pymilvus import connections
from pymilvus import Collection

con = connections.connect(
  alias="default",
  user='username',
  password='password',
  host='DB host',
  port='19530'
)

collection = Collection(
    name=collection_name,
    schema=schema,
    using='default',
    shards_num=2
    )
        

Data Preparation

By inputting the data and establishing a fresh index using the COSINE similarity function, I experimented with various other similarity function choices and determined that the COSINE similarity function consistently delivers the most favorable outcomes.

from pymilvus import Collection
import pandas as pd

df = pd.read_csv("./HSCodeComplete.csv")
df.head()
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = df['Description'].values
hscodes = df['HSCode'].values
headings = df['Heading'].values
chapters = df['Chapter'].values
sections = df['Section'].values
sentence_embeddings = model.encode(sentences)

collection = Collection("hscode_MiniLM_L6")  # Get an existing collection.
import random
#import the data with random id
data = [
  [i for i in range(7470)],
  hscodes,
  sentences,
    sections,
    chapters,
    headings,
  sentence_embeddings
]
data
mr = collection.insert(data)        

Search Results

When employing a tricky query like "apple iPhone" to search the data, the results can vary, potentially encompassing food-related or electronic-related items, contingent on the proficiency of the language model,

import json
collection = Collection("hscode_MiniLM_L6") 
collection.load()
search_params = {
    "metric_type":"COSINE",    #"COSINE",
    "offset": 0, 
    "ignore_growing": False, 
    "params": {"nprobe": 10}
}
search_items = ["apple iphone" ]
search_embeddings = model.encode(search_items)
search_embeddings.shape
results = collection.search(
    data=search_embeddings, 
    anns_field="description_vector", 
    # the sum of `offset` in `param` and `limit` 
    # should be less than 16384.
    param=search_params,
    limit=5,
    expr=None,
    # set the names of the fields you want to 
    # retrieve from the search result.
    output_fields=['hscode', 'description'], #, 
    consistency_level="Strong"
)

data = []
for hits in iter(results):
   # print(str(hits))
    jdata = json.loads(str(hits))
    data.append(jdata)

data        

The top 5 results are as following,

[["id: 6625, distance: 0.3133668601512909, entity: {'hscode': 85437031, 'description': 'Electronic Cigarettes'}",
  "id: 6953, distance: 0.308815598487854, entity: {'hscode': 90064000, 'description': 'Instant print cameras.'}",
  "id: 6451, distance: 0.29862725734710693, entity: {'hscode': 85171200, 'description': 'Telephones for cellular networks or for other wireless networks'}",
  "id: 6980, distance: 0.2906135320663452, entity: {'hscode': 90132090, 'description': 'Other devices, appliances and instruments :'}",
  "id: 923, distance: 0.28824496269226074, entity: {'hscode': 13021940, 'description': 'Aloes.'}"]]        

The third results is the correct one, and all results are related to telephone and electronics which is the correct representation to the searched commodity.

2- Langchain with OpenAI embedding model

Langchain, harnessing the power of OpenAI's cutting-edge embedding model, marks a transformative leap in the field of natural language processing and understanding. Whether it's for content recommendation, sentiment analysis, or semantic search, Langchain's integration with OpenAI's model ensures that the richness of language is harnessed to its fullest.

Initializing the environment

Install langchain library into your Jupyter Notebook

pip install langchain        

Data Preparation

Here we will use langchain direct integration with Milvus database to insert the data into new collection. However, the API is not flexible enough to insert more columns into the collection.

replace with your OpenAI API key and MilvusDb address.

from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader

sentences = df['Description'].values

embeddings_model = OpenAIEmbeddings(openai_api_key="OpenAI API key", model="text-embedding-ada-002")
vector_db = Milvus.from_texts(
    sentences,
    embeddings_model,
    drop_old = True,
    collection_name = 'hscode_openai',   
    connection_args={"host": "Host Address", "port": "19530"},
)        

Search Results

Now we try with the same test sentence and observe the results,

docs = vector_db.similarity_search_with_score("apple iphone mobile", k=5)
docs        

The code generates the top five similarity to the passed input.

[(Document(page_content='Apples, fresh or chilled.'), 0.36462166905403137),
 (Document(page_content='Charger for mobile phone and tablets'),
  0.3666340708732605),
 (Document(page_content='headphones for line telephones'),
  0.39265894889831543),
 (Document(page_content='Telephones for cellular networks or for other wireless networks'),
  0.39536523818969727),
 (Document(page_content='Apple juice, of a Brix value exceeding 20.'),
  0.39887306094169617)]        

Here we can notice the model confused the electronics with food products, still it gave the fourth result as the correct one.

Conclusion

While the initial Hugging Face model provided more relevant search results, opting for Langchain with OpenAI offers distinct advantages, including the avoidance of on-premises hosting of the LLM model and seamless integration with the Milvus Database.

When it comes to Langchain, it's important to note that there are certain limitations associated with defining additional columns within the designated collection. There are constraints limit flexibility in altering the underlying similarity function. Additionally, there's a cost associated with making embedding calls to the OpenAI models.


Nicola A.

Co-founder, COO Pigro - Power up your workspace with Pigro website: pigro.ai

1 年

Building a solution that works for every application is hard. We recently released a solution to split documents into optimal chunks of text. We split PDF and Office files based on the original document structure and content semantics.

回复
Raghu Ugare

Engineering Leader. Architect. Love Physics, Math, Programming Languages, FP, AI & ML.

1 年

Very nice practical intro with working examples of playing with LLM's indeed Ahmed Fayed Elnahel. We too had very interesting adventures in the field of "Semantic" Search when I was working for one of India's largest e-pharmacy cos... (Not?) surprisingly, some of the ML-based models easily began to beat the traditional "Lexical" search algorithms in terms of ATC & other metrics! The future is definitely filled with interesting possibilities with AI/ML...!

Ahmed Abd Elmaksoud

Solutions Architect | Technical Project Manager | Engineering Lead | Digital Transformation | IT Systems Expert

1 年

Good Job Eng Ahmed, I am wondering if you test the model on Arabic data and what is the performance of the model in this case

Mohammed S.

Functional Lead in Customs & Sea Port Digital Transformation | e-Commerce | Supply Chain | Data Analysis & Governance | AI Effectiveness | Oracle 23ai | Requirements Mgmt | Agile SAFe | Product Lead Workflow patterns

1 年

Valuable insights?intersection of semantic search?with ( Hugging Face Leaderboard?+ Open AI)

要查看或添加评论,请登录

Ahmed Fayed Elnahel的更多文章

社区洞察

其他会员也浏览了