Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models
Suman Biswas
Engineering Leadership, Emerging Tech & AI - Enterprise Architecture | Digital Strategy | Building Responsible AI Platform
In the rapidly advancing field of artificial intelligence, two key technologies have become essential: vector databases and embeddings. These tools are pivotal in managing and interpreting the high-dimensional data that is the lifeblood of Large Language Models (LLMs) like GPT-4. In this blog, we'll explore what vector databases and embeddings are, their importance, and how they work together to power LLMs.
Understanding Vector Databases
What are Vector Databases? Vector databases are designed to store and manage vector data, which are arrays of numbers representing complex data suitable for machine learning models. They are vital for operations involving high-dimensional vectors, a common feature in AI applications.
Key Features These databases are renowned for their efficiency in indexing and querying high-dimensional data. They offer scalability and high-speed data processing, crucial for handling large AI datasets.
The Significance of Embeddings
What are Embeddings? Embeddings are dense vector representations of text, images, or other data types. In the context of LLMs, they transform words or phrases into numerical vectors, capturing semantic meaning that the model can understand and process.
Why are Embeddings Important? Embeddings allow LLMs to grasp the nuances of language, including context, tone, and semantic relationships. This understanding is critical for applications like natural language processing, sentiment analysis, and more.
Integrating Vector Databases with LLM Embeddings
Enhancing LLM Efficiency Vector databases efficiently store and retrieve embeddings generated by LLMs. This integration is crucial for applications requiring quick access to processed data, like real-time language translation or contextual search engines.
Real-World Applications
AI and Machine Learning From natural language processing to recommendation systems, embeddings and vector databases are at the forefront of modern AI applications, enabling more accurate and contextually aware models.
Search Engine Optimization By understanding the semantic context, vector databases equipped with embeddings provide more relevant and nuanced search results, going beyond simple keyword matching.
Code Example: Generating and Storing Embeddings
Let's look at a simple Python example demonstrating how embeddings can be generated using an LLM and stored in a vector database:
from transformers import AutoTokenizer, AutoModel
import torch
import pinecone
import numpy as np
# Initialize tokenizer and model from the Hugging Face library
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Sample text
text = "The quick brown fox jumps over the lazy dog"
# Tokenize and generate embeddings
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1).numpy()
# Initialize Pinecone client
pinecone.init(api_key="************", environment="*********")
index_name = "exampleindex"
# Create or connect to an existing index
if index_name not in pinecone.list_indexes():
pinecone.create_index(index_name, dimension=20)
index = pinecone.Index(index_name)
# Convert numpy array to list
embedding_list = embeddings[0].tolist()
# Store the embedding as a list
index.upsert(vectors=[("sample_text1", embedding_list)])
# Store the embedding
#index.upsert(vectors=[("sample_text1", embeddings[0])])
# Retrieve and use the embedding
retrieved_embedding = index.fetch(ids=["sample_text1"])
print(retrieved_embedding)
{'namespace': '',
'vectors': {'sample_text1': {'id': 'sample_text1',
'metadata': {},
'values': [-0.140095562,
-0.166087076,
0.121717781,
0.109698921,
0.306327224,
0.13305968,
-0.154927716,
0.425377071,
-0.110659473,
-0.186022699,
-0.0476978272,
-0.0626252592,
-0.243112564,
-0.0290332027,
-0.516384423,
This code uses the BERT model from the Hugging Face library to generate embeddings for a piece of text and then stores these embeddings in a vector database. The embeddings can later be retrieved for various applications.
领英推荐
Code Example: Semantic Search
In this example, we will skip the data preparation steps as they can be very time-consuming and jump straight into it with the prebuilt dataset from Pinecone Datasets. Let's go ahead and download the dataset.
from pinecone_datasets import load_dataset
dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()
id values sparse_values metadata
240000 515997 [-0.00531694, 0.06937869, -0.0092854, 0.003286... {'indices': [845, 1657, 13677, 20780, 27058, 2... {'text': ' Why is a "law of sciences" importan...
240001 515998 [-0.09243751, 0.065432355, -0.06946959, 0.0669... {'indices': [2110, 6324, 9754, 13677, 15207, 2... {'text': ' Is it possible to format a BitLocke...
240002 515999 [-0.021924071, 0.032280188, -0.020190848, 0.07... {'indices': [2110, 4949, 23579, 23758, 27058, ... {'text': ' Can formatting a hard drive stress ...
240003 516000 [-0.120020054, 0.024080949, 0.10693012, -0.018... {'indices': [22014, 24734, 24773, 25791, 25991... {'text': ' Are the new Samsung Galaxy J7 and J...
240004 516001 [-0.095293395, -0.048446465, -0.017618902, -0.... {'indices': [307, 2110, 5785, 12969, 12971, 13... {'text': ' I just watched an add for Indonesia...
Creating an Index
Now the data is ready, we can set up our index to store it. We begin by initializing our connection to Pinecone.
import os
import pinecone
# get api key from app.pinecone.io
PINECONE_API_KEY = '***************'
# find your environment next to the api key in pinecone console
PINECONE_ENV ='********'
pinecone.init(
api_key=PINECONE_API_KEY,
environment=PINECONE_ENV
)
Now we create a new index called semantic-search-fast. It's important that we align the index dimension and metric parameters with those required by the MiniLM-L6 model.
index_name = 'semantic-search-fast'
import time
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
pinecone.create_index(
name=index_name,
dimension=len(dataset.documents.iloc[0]['values']),
metric='cosine'
)
# wait a moment for the index to be fully initialized
time.sleep(1)
# now connect to the index
index = pinecone.GRPCIndex(index_name)
for batch in dataset.iter_documents(batch_size=100):
index.upsert(batch)
Making Queries
Now that our index is populated we can begin making queries. We are performing a semantic search for similar questions, so we should embed and search with another question. Let's begin.
from sentence_transformers import SentenceTransformer
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model
query = "which city has the highest population in the world?"
# create the query vector
xq = model.encode(query).tolist()
# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc
{'matches': [{'id': '69331',
'metadata': {'text': " What's the world's largest city?"},
'score': 0.78591084,
'sparse_values': {'indices': [], 'values': []},
'values': []},
{'id': '69332',
'metadata': {'text': ' What is the biggest city?'},
'score': 0.7273166,
'sparse_values': {'indices': [], 'values': []},
'values': []},
{'id': '84749',
'metadata': {'text': " What are the world's most advanced "
'cities?'},
'score': 0.7100672,
'sparse_values': {'indices': [], 'values': []},
'values': []},
{'id': '109231',
'metadata': {'text': ' Where is the most beautiful city in the '
'world?'},
'score': 0.69609785,
'sparse_values': {'indices': [], 'values': []},
'values': []},
{'id': '109230',
'metadata': {'text': ' What is the greatest, most beautiful city '
'in the world?'},
'score': 0.6582236,
'sparse_values': {'indices': [], 'values': []},
'values': []}],
'namespace': ''}
In the returned response xc we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:
for result in xc['matches']:
print(f"{round(result['score'], 2)}: {result['metadata']['text']}")
0.79: What's the world's largest city?
0.73: What is the biggest city?
0.71: What are the world's most advanced cities?
0.7: Where is the most beautiful city in the world?
0.66: What is the greatest, most beautiful city in the world?
The synergy between vector databases and embeddings is revolutionizing the way LLMs process and understand complex data. As these technologies continue to evolve, they will undoubtedly play a crucial role in shaping the future of AI and machine learning.