登录查看更多内容

Exploring the Future of Semantic Search: Hugging Face Leaderboard Model vs. Langchain with OpenAI

Ahmed Fayed Elnahel

Digital Transformation | Enterprise Architecture | Data Science

发布日期: 2023年10月31日

Introduction

In the ever-evolving landscape of artificial intelligence and natural language processing, key players have emerged as frontrunners, each contributing significantly to the development of advanced technologies. Hugging Face, renowned for its pioneering work in language models and model deployment, and OpenAI, the organization behind the revolutionary GPT-3 and GPT-4 models, are leaders in this dynamic field. Additionally, semantic search has become a pivotal application in the realm of AI, offering enhanced information retrieval and comprehension. This article explores the dynamic interplay of these elements: semantic search with the Hugging Face Leaderboard language models, and OpenAI's contributions, utilizing the power of Milvus VectorDB.

In this article, we delve into how semantic search, the Hugging Face Leaderboard, and OpenAI's innovations intersect with Milvus VectorDB, a high-performance vector database designed to support similarity search at scale.

Initializing the environment

Install Milvus database following the steps in https://milvus.io/docs/install_standalone-docker.md

Install sentence transformers and Milvus libraries using,

pip install sentence-transformers
pip install protobuf==3.20.0
pip install grpcio-tools
pip install pymilvus

Customs commodity codes dataset

An HSCODE dataset is a valuable resource for organizations and individuals involved in international trade and customs compliance. It contains a comprehensive list of Customs Harmonized Codes (HSCODEs), which are standardized codes used to classify and categorize various products and goods for customs and trade purposes. Each HSCODE corresponds to a a group of products or commodities, and the dataset typically includes detailed descriptions of these HSCODEs.

The Customs commodity codes datasets can be downloaded from Dubai Open Data platform under https://www.dubaipulse.gov.ae/organisation/dubai-customs/service/dc-records

1- Hugging Face LLMs

Hugging Face, a company that has been at the forefront of developing and sharing state-of-the-art NLP models, has consistently maintained a position of leadership in the field.

The Hugging Face leader-board for Sentence similarity models can be found here https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads

Utilizing "sentence-transformers/all-MiniLM-L6-v2" embedding the "HSCODE descriptions" for similarity searches. This versatile language model adeptly converts customs Harmonized Code (HSCODE) descriptions into numerical representations that encapsulate their semantic essence. Through generating embeddings for descriptions and user queries, similarity assessments, like cosine similarity, can be employed.

MilvusDB Schema

Starting with creating an Milvus schema for the HSCODE information,

from sklearn import preprocessing
       
from pymilvus import CollectionSchema, FieldSchema, DataType

id = FieldSchema(

  name="row_id",

  dtype=DataType.INT64,

  is_primary=True,

)

hscode = FieldSchema(

  name="hscode",

  dtype=DataType.INT64, #DOUBLE

  max_length=200,

  # The default value will be used if this field is left empty during data inserts or upserts.

  # The data type of default_value must be the same as that specified in dtype.

  default_value="Unknown"

)

description = FieldSchema(

  name="description",

  dtype=DataType.VARCHAR,

  max_length=1000,

  default_value="Unknown"

)

section = FieldSchema(

  name="section",

  dtype=DataType.VARCHAR,

  max_length=300,

  default_value="Unknown"

)

chapter = FieldSchema(

  name="chapter",

  dtype=DataType.VARCHAR,

  max_length=500,

  default_value="Unknown"

)

heading = FieldSchema(

  name="heading",

  dtype=DataType.VARCHAR,

  max_length=1000,

  default_value="Unknown"

)

description_vector = FieldSchema(

  name="description_vector",

  dtype=DataType.FLOAT_VECTOR,

  dim=384

)

schema = CollectionSchema(

  fields=[id, hscode, description,section,chapter,heading, description_vector],

  description="hscodes",

  enable_dynamic_field=True

)

collection_name = "hscode_MiniLM_L6"

Next is to connect to the Milvus database and create a new collection based on the defined schema

from pymilvus import connections
from pymilvus import Collection

con = connections.connect(
  alias="default",
  user='username',
  password='password',
  host='DB host',
  port='19530'
)

collection = Collection(
    name=collection_name,
    schema=schema,
    using='default',
    shards_num=2
    )

Data Preparation

By inputting the data and establishing a fresh index using the COSINE similarity function, I experimented with various other similarity function choices and determined that the COSINE similarity function consistently delivers the most favorable outcomes.

from pymilvus import Collection
import pandas as pd

df = pd.read_csv("./HSCodeComplete.csv")
df.head()
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
sentences = df['Description'].values
hscodes = df['HSCode'].values
headings = df['Heading'].values
chapters = df['Chapter'].values
sections = df['Section'].values
sentence_embeddings = model.encode(sentences)

collection = Collection("hscode_MiniLM_L6")  # Get an existing collection.
import random
#import the data with random id
data = [
  [i for i in range(7470)],
  hscodes,
  sentences,
    sections,
    chapters,
    headings,
  sentence_embeddings
]
data
mr = collection.insert(data)

领英推荐

This AI newsletter is all you need #42

Towards AI 1 年前

GPT-5: What to Expect and How It Might Change the AI

eLEOPARD 6 个月前

RAG: From Concept to Advanced Implementation - A…

Brij kishore Pandey 6 个月前

Search Results

When employing a tricky query like "apple iPhone" to search the data, the results can vary, potentially encompassing food-related or electronic-related items, contingent on the proficiency of the language model,

import json
collection = Collection("hscode_MiniLM_L6") 
collection.load()
search_params = {
    "metric_type":"COSINE",    #"COSINE",
    "offset": 0, 
    "ignore_growing": False, 
    "params": {"nprobe": 10}
}
search_items = ["apple iphone" ]
search_embeddings = model.encode(search_items)
search_embeddings.shape
results = collection.search(
    data=search_embeddings, 
    anns_field="description_vector", 
    # the sum of `offset` in `param` and `limit` 
    # should be less than 16384.
    param=search_params,
    limit=5,
    expr=None,
    # set the names of the fields you want to 
    # retrieve from the search result.
    output_fields=['hscode', 'description'], #, 
    consistency_level="Strong"
)

data = []
for hits in iter(results):
   # print(str(hits))
    jdata = json.loads(str(hits))
    data.append(jdata)

data

The top 5 results are as following,

[["id: 6625, distance: 0.3133668601512909, entity: {'hscode': 85437031, 'description': 'Electronic Cigarettes'}",
  "id: 6953, distance: 0.308815598487854, entity: {'hscode': 90064000, 'description': 'Instant print cameras.'}",
  "id: 6451, distance: 0.29862725734710693, entity: {'hscode': 85171200, 'description': 'Telephones for cellular networks or for other wireless networks'}",
  "id: 6980, distance: 0.2906135320663452, entity: {'hscode': 90132090, 'description': 'Other devices, appliances and instruments :'}",
  "id: 923, distance: 0.28824496269226074, entity: {'hscode': 13021940, 'description': 'Aloes.'}"]]

The third results is the correct one, and all results are related to telephone and electronics which is the correct representation to the searched commodity.

2- Langchain with OpenAI embedding model

Langchain, harnessing the power of OpenAI's cutting-edge embedding model, marks a transformative leap in the field of natural language processing and understanding. Whether it's for content recommendation, sentiment analysis, or semantic search, Langchain's integration with OpenAI's model ensures that the richness of language is harnessed to its fullest.

Initializing the environment

Install langchain library into your Jupyter Notebook

pip install langchain

Data Preparation

Here we will use langchain direct integration with Milvus database to insert the data into new collection. However, the API is not flexible enough to insert more columns into the collection.

replace with your OpenAI API key and MilvusDb address.

from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Milvus
from langchain.document_loaders import TextLoader

sentences = df['Description'].values

embeddings_model = OpenAIEmbeddings(openai_api_key="OpenAI API key", model="text-embedding-ada-002")
vector_db = Milvus.from_texts(
    sentences,
    embeddings_model,
    drop_old = True,
    collection_name = 'hscode_openai',   
    connection_args={"host": "Host Address", "port": "19530"},
)

Search Results

Now we try with the same test sentence and observe the results,

docs = vector_db.similarity_search_with_score("apple iphone mobile", k=5)
docs

The code generates the top five similarity to the passed input.

[(Document(page_content='Apples, fresh or chilled.'), 0.36462166905403137),
 (Document(page_content='Charger for mobile phone and tablets'),
  0.3666340708732605),
 (Document(page_content='headphones for line telephones'),
  0.39265894889831543),
 (Document(page_content='Telephones for cellular networks or for other wireless networks'),
  0.39536523818969727),
 (Document(page_content='Apple juice, of a Brix value exceeding 20.'),
  0.39887306094169617)]

Here we can notice the model confused the electronics with food products, still it gave the fourth result as the correct one.

Conclusion

While the initial Hugging Face model provided more relevant search results, opting for Langchain with OpenAI offers distinct advantages, including the avoidance of on-premises hosting of the LLM model and seamless integration with the Milvus Database.

When it comes to Langchain, it's important to note that there are certain limitations associated with defining additional columns within the designated collection. There are constraints limit flexibility in altering the underlying similarity function. Additionally, there's a cost associated with making embedding calls to the OpenAI models.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Nicola A.

Co-founder, COO Pigro - Power up your workspace with Pigro website: pigro.ai

1 年

Building a solution that works for every application is hard. We recently released a solution to split documents into optimal chunks of text. We split PDF and Office files based on the original document structure and content semantics.

Raghu Ugare

Engineering Leader. Architect. Love Physics, Math, Programming Languages, FP, AI & ML.

1 年

Very nice practical intro with working examples of playing with LLM's indeed Ahmed Fayed Elnahel. We too had very interesting adventures in the field of "Semantic" Search when I was working for one of India's largest e-pharmacy cos... (Not?) surprisingly, some of the ML-based models easily began to beat the traditional "Lexical" search algorithms in terms of ATC & other metrics! The future is definitely filled with interesting possibilities with AI/ML...!

1 次回应

Ahmed Abd Elmaksoud

Solutions Architect | Technical Project Manager | Engineering Lead | Digital Transformation | IT Systems Expert

1 年

Good Job Eng Ahmed, I am wondering if you test the model on Arabic data and what is the performance of the model in this case

1 次回应

Mohammed S.

1 年

Valuable insights?intersection of semantic search?with ( Hugging Face Leaderboard?+ Open AI)

1 次回应

查看更多评论

要查看或添加评论，请登录

Ahmed Fayed Elnahel的更多文章

Leveraging Generative AI for Efficient Data Analysis: A Hybrid Approach

2025年1月31日

Leveraging Generative AI for Efficient Data Analysis: A Hybrid Approach

Introduction In today's data-driven world, organizations face the challenge of extracting meaningful insights from…

3 条评论
Bilangual Invoice Extraction: GPT-4 vs. Gemini multimodal LLMs

2024年2月22日

Bilangual Invoice Extraction: GPT-4 vs. Gemini multimodal LLMs

In the ever-evolving landscape of language processing, the emergence of advanced language models (LLMs) has…

1 条评论
Apriori associations example on banking dataset in R Studio

2022年11月30日

Apriori associations example on banking dataset in R Studio

Personal Equity Plan association rules. #Purpose The purpose of this LAb is to become familiar with: 1.
AWS Athena with Parquet vs. CSV

2021年11月5日

AWS Athena with Parquet vs. CSV

Using AWS Athena with parquet files is faster and cheaper than using other formats like CSV and JSON based file…
Process AWS S3 bucket files in parallel python batch jobs.

2021年11月1日

Process AWS S3 bucket files in parallel python batch jobs.

Processing large numbers of files residing under S3 bucket is a challenging task, it may take many hours or days if you…

8 条评论

See all articles

Exploring the Future of Semantic Search: Hugging Face Leaderboard Model vs. Langchain with OpenAI

Ahmed Fayed Elnahel

Digital Transformation | Enterprise Architecture | Data Science

Introduction

Initializing the environment

Customs commodity codes dataset

1- Hugging Face LLMs

MilvusDB Schema

Data Preparation

领英推荐

Search Results

2- Langchain with OpenAI embedding model

Initializing the environment

Data Preparation

Search Results

Conclusion

Ahmed Fayed Elnahel的更多文章

社区洞察

其他会员也浏览了

AI Frameworks in Action: Building RAG Systems with LangChain, LlamaIndex, and Haystack!

Multimodality: Next Wave in Artificial Intelligence

RoBERTa: A Powerful Cousin of GPT for Social Sentiment Analysis

LLM Watch#11: Equipping LLMs with Better Long-Term Memory

Getting Started with Your First RAG System in LlamaIndex

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

The LLMOps Lifecycle: Managing Large Language Models Effectively

LLM Frameworks Demystified (Part 2): Thin LLM Wrappers

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Top LLM Papers of the Week (July Week 2, 2024)

Introduction

Initializing the environment

Customs commodity codes dataset

1- Hugging Face LLMs

MilvusDB Schema

Data Preparation

领英推荐

Search Results

2- Langchain with OpenAI embedding model

Initializing the environment

Data Preparation

Search Results

Conclusion

Ahmed Fayed Elnahel的更多文章

Leveraging Generative AI for Efficient Data Analysis: A Hybrid Approach

Bilangual Invoice Extraction: GPT-4 vs. Gemini multimodal LLMs

Apriori associations example on banking dataset in R Studio

AWS Athena with Parquet vs. CSV

Process AWS S3 bucket files in parallel python batch jobs.

社区洞察

其他会员也浏览了

AI Frameworks in Action: Building RAG Systems with LangChain, LlamaIndex, and Haystack!

Multimodality: Next Wave in Artificial Intelligence

RoBERTa: A Powerful Cousin of GPT for Social Sentiment Analysis

LLM Watch#11: Equipping LLMs with Better Long-Term Memory

Getting Started with Your First RAG System in LlamaIndex

FineTuning BERT- Named Entity Recognition - Bidirectional Encoders Representation of Transformers - Part 4

The LLMOps Lifecycle: Managing Large Language Models Effectively

LLM Frameworks Demystified (Part 2): Thin LLM Wrappers

How to Build Powerful LLM Apps with Vector Databases + RAG - AI&YOU #55

Top LLM Papers of the Week (July Week 2, 2024)