登录查看更多内容

Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models

Suman Biswas

Engineering Leadership, Emerging Tech & AI - Enterprise Architecture | Digital Strategy | Building Responsible AI Platform

发布日期: 2023年11月17日

In the rapidly advancing field of artificial intelligence, two key technologies have become essential: vector databases and embeddings. These tools are pivotal in managing and interpreting the high-dimensional data that is the lifeblood of Large Language Models (LLMs) like GPT-4. In this blog, we'll explore what vector databases and embeddings are, their importance, and how they work together to power LLMs.

Understanding Vector Databases

What are Vector Databases? Vector databases are designed to store and manage vector data, which are arrays of numbers representing complex data suitable for machine learning models. They are vital for operations involving high-dimensional vectors, a common feature in AI applications.

Key Features These databases are renowned for their efficiency in indexing and querying high-dimensional data. They offer scalability and high-speed data processing, crucial for handling large AI datasets.

The Significance of Embeddings

What are Embeddings? Embeddings are dense vector representations of text, images, or other data types. In the context of LLMs, they transform words or phrases into numerical vectors, capturing semantic meaning that the model can understand and process.

Why are Embeddings Important? Embeddings allow LLMs to grasp the nuances of language, including context, tone, and semantic relationships. This understanding is critical for applications like natural language processing, sentiment analysis, and more.

Integrating Vector Databases with LLM Embeddings

Enhancing LLM Efficiency Vector databases efficiently store and retrieve embeddings generated by LLMs. This integration is crucial for applications requiring quick access to processed data, like real-time language translation or contextual search engines.

Real-World Applications

AI and Machine Learning From natural language processing to recommendation systems, embeddings and vector databases are at the forefront of modern AI applications, enabling more accurate and contextually aware models.

Search Engine Optimization By understanding the semantic context, vector databases equipped with embeddings provide more relevant and nuanced search results, going beyond simple keyword matching.

Code Example: Generating and Storing Embeddings

Let's look at a simple Python example demonstrating how embeddings can be generated using an LLM and stored in a vector database:

from transformers import AutoTokenizer, AutoModel
import torch
import pinecone
import numpy as np

# Initialize tokenizer and model from the Hugging Face library
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Sample text
text = "The quick brown fox jumps over the lazy dog"

# Tokenize and generate embeddings
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).numpy()

# Initialize Pinecone client
pinecone.init(api_key="************", environment="*********")
index_name = "exampleindex"

# Create or connect to an existing index
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=20)
index = pinecone.Index(index_name)

# Convert numpy array to list
embedding_list = embeddings[0].tolist()

# Store the embedding as a list
index.upsert(vectors=[("sample_text1", embedding_list)])

# Store the embedding
#index.upsert(vectors=[("sample_text1", embeddings[0])])

# Retrieve and use the embedding
retrieved_embedding = index.fetch(ids=["sample_text1"])
print(retrieved_embedding)

{'namespace': '',
 'vectors': {'sample_text1': {'id': 'sample_text1',
                              'metadata': {},
                              'values': [-0.140095562,
                                         -0.166087076,
                                         0.121717781,
                                         0.109698921,
                                         0.306327224,
                                         0.13305968,
                                         -0.154927716,
                                         0.425377071,
                                         -0.110659473,
                                         -0.186022699,
                                         -0.0476978272,
                                         -0.0626252592,
                                         -0.243112564,
                                         -0.0290332027,
                                         -0.516384423,

This code uses the BERT model from the Hugging Face library to generate embeddings for a piece of text and then stores these embeddings in a vector database. The embeddings can later be retrieved for various applications.

领英推荐

RAG Techniques Every AI/ML/Data Engineer Should Know!

Pavan Belagatti 6 个月前

GEN AI Series - Enterprise Unified Semantic Search:…

Jothi Periasamy 1 个月前

Advanced Retrieval-Augmented Generation (RAG) for…

Anand Ramachandran 6 个月前

Code Example: Semantic Search

In this example, we will skip the data preparation steps as they can be very time-consuming and jump straight into it with the prebuilt dataset from Pinecone Datasets. Let's go ahead and download the dataset.

from pinecone_datasets import load_dataset

dataset = load_dataset('quora_all-MiniLM-L6-bm25')
# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use 80K rows of the dataset between rows 240K -> 320K
dataset.documents.drop(dataset.documents.index[320_000:], inplace=True)
dataset.documents.drop(dataset.documents.index[:240_000], inplace=True)
dataset.head()

	id	values	sparse_values	metadata
240000	515997	[-0.00531694, 0.06937869, -0.0092854, 0.003286...	{'indices': [845, 1657, 13677, 20780, 27058, 2...	{'text': ' Why is a "law of sciences" importan...
240001	515998	[-0.09243751, 0.065432355, -0.06946959, 0.0669...	{'indices': [2110, 6324, 9754, 13677, 15207, 2...	{'text': ' Is it possible to format a BitLocke...
240002	515999	[-0.021924071, 0.032280188, -0.020190848, 0.07...	{'indices': [2110, 4949, 23579, 23758, 27058, ...	{'text': ' Can formatting a hard drive stress ...
240003	516000	[-0.120020054, 0.024080949, 0.10693012, -0.018...	{'indices': [22014, 24734, 24773, 25791, 25991...	{'text': ' Are the new Samsung Galaxy J7 and J...
240004	516001	[-0.095293395, -0.048446465, -0.017618902, -0....	{'indices': [307, 2110, 5785, 12969, 12971, 13...	{'text': ' I just watched an add for Indonesia...

Creating an Index

Now the data is ready, we can set up our index to store it. We begin by initializing our connection to Pinecone.

import os
import pinecone

# get api key from app.pinecone.io
PINECONE_API_KEY = '***************'
# find your environment next to the api key in pinecone console
PINECONE_ENV ='********'

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENV
)

Now we create a new index called semantic-search-fast. It's important that we align the index dimension and metric parameters with those required by the MiniLM-L6 model.

index_name = 'semantic-search-fast'

import time

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        name=index_name,
        dimension=len(dataset.documents.iloc[0]['values']),
        metric='cosine'
    )
    # wait a moment for the index to be fully initialized
    time.sleep(1)

# now connect to the index
index = pinecone.GRPCIndex(index_name)
for batch in dataset.iter_documents(batch_size=100):
    index.upsert(batch)

Making Queries

Now that our index is populated we can begin making queries. We are performing a semantic search for similar questions, so we should embed and search with another question. Let's begin.

from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model
query = "which city has the highest population in the world?"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': '69331',
              'metadata': {'text': " What's the world's largest city?"},
              'score': 0.78591084,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '69332',
              'metadata': {'text': ' What is the biggest city?'},
              'score': 0.7273166,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '84749',
              'metadata': {'text': " What are the world's most advanced "
                                   'cities?'},
              'score': 0.7100672,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '109231',
              'metadata': {'text': ' Where is the most beautiful city in the '
                                   'world?'},
              'score': 0.69609785,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': '109230',
              'metadata': {'text': ' What is the greatest, most beautiful city '
                                   'in the world?'},
              'score': 0.6582236,
              'sparse_values': {'indices': [], 'values': []},
              'values': []}],
 'namespace': ''}

In the returned response xc we can see the most relevant questions to our particular query — we don't have any exact matches but we can see that the returned questions are similar in the topics they are asking about. We can reformat this response to be a little easier to read:

for result in xc['matches']:
    print(f"{round(result['score'], 2)}: {result['metadata']['text']}")

0.79:  What's the world's largest city?
0.73:  What is the biggest city?
0.71:  What are the world's most advanced cities?
0.7:  Where is the most beautiful city in the world?
0.66:  What is the greatest, most beautiful city in the world?

The synergy between vector databases and embeddings is revolutionizing the way LLMs process and understand complex data. As these technologies continue to evolve, they will undoubtedly play a crucial role in shaping the future of AI and machine learning.

https://colab.research.google.com/drive/1RezsIR-E-mScdFs9rSu6zZtodufjCO5N?usp=sharing

要查看或添加评论，请登录

Suman Biswas的更多文章

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

2025年1月29日

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

The AI research landscape has been buzzing with excitement over the release of DeepSeek R1, a powerful new large…

3 条评论
Function Calling with Large Language Models (LLMs)

2024年10月28日

Function Calling with Large Language Models (LLMs)

Introduction to Function Calling in LLMs Function calling within large language models is a powerful feature that…
Multimodal Prompting with Llama 3.2

2024年10月26日

Multimodal Prompting with Llama 3.2

Introduction to Multimodal Prompting In the world of advanced AI, multimodal prompting is gaining prominence. This…

1 条评论
Hypothetical Document Embeddings (HyDE)

2024年8月17日

Hypothetical Document Embeddings (HyDE)

Introduction Hypothetical Document Embeddings (HyDE) is a cutting-edge technique that extends the utility of…

1 条评论
AI Agents: The Future of Generative AI

2024年7月29日

AI Agents: The Future of Generative AI

2024 will be the year of AI agents. So, what are AI agents? To explain this, we need to look at the various shifts in…

3 条评论
Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

2024年1月21日

Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

Reinforcement Learning (RL) is a cornerstone of modern artificial intelligence, teaching machines to make decisions by…

5 条评论
Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

2023年11月5日

Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

The backbone of any enterprise is its data, and SQL databases have long been the standard for storing this invaluable…

6 条评论
Exploring the Power of LLMs in Supervised Learning

2023年10月29日

Exploring the Power of LLMs in Supervised Learning

Language Models (LLMs) are more than just text generators; they are intelligent companions for your supervised learning…
Big Data & Agile

2016年2月17日

Big Data & Agile

Unraveling the Significance of Big Data and Agile: Innovation and Motivation In today's fast-paced digital landscape…

See all articles

Exploring the Power of Vector Databases and Embeddings in Enhancing Large Language Models

Suman Biswas

Engineering Leadership, Emerging Tech & AI - Enterprise Architecture | Digital Strategy | Building Responsible AI Platform

Understanding Vector Databases

The Significance of Embeddings

Integrating Vector Databases with LLM Embeddings

Real-World Applications

Code Example: Generating and Storing Embeddings

领英推荐

Code Example: Semantic Search

Creating an Index

Making Queries

Suman Biswas的更多文章

社区洞察

其他会员也浏览了

New Open Long-Context LLM; LLMs For Text Analysis; Graph-2-Text Generative Models; Fine-Tune Your Own Llama 2; and More

The LLMOps Lifecycle: Managing Large Language Models Effectively

natlagram: How We Translated Words to Diagrams With the Help of GPT and Kroki

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

Building Smarter Web Applications: A Guide to AI Integration with Laravel

Decoding RAG: Retrieval Augmented Generation for Enhanced?AI

Importance of semantic technology in Artificial Intelligence

Effective Use Cases for Machine Learning and Large Language Models

Exploring the Landscape of Large Language Models (LLMs): A Comparative Guide

Unlocking the Power of Large Language Models: The Crucial Role of Data Annotation

Understanding Vector Databases

The Significance of Embeddings

Integrating Vector Databases with LLM Embeddings

Real-World Applications

Code Example: Generating and Storing Embeddings

领英推荐

Code Example: Semantic Search

Creating an Index

Making Queries

Suman Biswas的更多文章

DeepSeek R1: Redefining AI with Reasoning, Learning, and Accessibility

Function Calling with Large Language Models (LLMs)

Multimodal Prompting with Llama 3.2

Hypothetical Document Embeddings (HyDE)

AI Agents: The Future of Generative AI

Understanding Reinforcement Learning from Human Feedback (RLHF): A Practical Guide

Leveraging LLMs for Intuitive Interactions with Enterprise SQL Databases

Exploring the Power of LLMs in Supervised Learning

Big Data & Agile

社区洞察

其他会员也浏览了

New Open Long-Context LLM; LLMs For Text Analysis; Graph-2-Text Generative Models; Fine-Tune Your Own Llama 2; and More

The LLMOps Lifecycle: Managing Large Language Models Effectively

natlagram: How We Translated Words to Diagrams With the Help of GPT and Kroki

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

Building Smarter Web Applications: A Guide to AI Integration with Laravel

Decoding RAG: Retrieval Augmented Generation for Enhanced?AI

Importance of semantic technology in Artificial Intelligence

Effective Use Cases for Machine Learning and Large Language Models

Exploring the Landscape of Large Language Models (LLMs): A Comparative Guide

Unlocking the Power of Large Language Models: The Crucial Role of Data Annotation