登录查看更多内容

Unlocking the Power of pgVector: Distance Functions and Indexing Explained

Zahir Shaikh

Lead (Generative AI / Automation) @ T-Systems | Specializing in Automation, Large Language Models (LLM), LLAMA Index, Langchain | Expert in Deep Learning, Machine Learning, NLP, Vector Databases | RPA

发布日期: 2024年12月22日

PostgreSQL is a powerhouse for relational data, but with the rise of machine learning and AI, managing and querying vector embeddings has become increasingly important. Enter pgVector, a PostgreSQL extension that adds native support for vectors and enables efficient similarity searches. In this article, we’ll explore the various distance functions provided by pgVector and how indexing can significantly boost query performance.

What is pgVector?

pgVector extends PostgreSQL by introducing a new data type—vector—for storing n-dimensional vectors. It also includes support for similarity searches using various distance metrics, making it a natural choice for applications in recommendation systems, natural language processing, and computer vision.

Distance Functions in pgVector

pgVector supports several distance metrics to measure similarity or dissimilarity between vectors. Here’s an overview of the available functions:

1. L2 Distance (<->)

Formula:
Description: Computes the Euclidean distance between two vectors. It is the "straight-line" distance in n-dimensional space.
Use Case: Best suited for applications where spatial distance matters, such as image recognition or 3D point cloud analysis.

Example Query:

SELECT id, embedding, embedding <-> '[1, 2, 3]' AS l2_distance
FROM items
ORDER BY l2_distance
LIMIT 5;

2. Negative Inner Product (<#>)

Formula:
Description: Computes the negative of the dot product between two vectors. Higher inner product values indicate greater similarity.
Use Case: Commonly used in machine learning models where the magnitude and direction of vectors matter.

Example Query:

SELECT id, embedding, embedding <#> '[1, 2, 3]' AS negative_inner_product 
FROM items 
ORDER BY negative_inner_product LIMIT 5;

3. Cosine Distance (<=>)

Formula:
Description: Measures the cosine of the angle between two vectors. A value of 0 indicates vectors are identical in direction.
Use Case: Ideal for text similarity, recommendation systems, and comparing normalized embeddings.

Example Query:

SELECT id, embedding, embedding <=> '[1, 2, 3]' AS cosine_distance 
FROM items 
ORDER BY cosine_distance  LIMIT 5;

4. L1 Distance (<+>) (Introduced in pgVector 0.7.0)

Formula:
Description: Calculates the Manhattan distance, summing the absolute differences of each vector component.
Use Case: Effective for sparse data and where differences along each dimension are equally important.

Example Query:

SELECT id, embedding, embedding <+> '[1, 2, 3]' AS l1_distance 
FROM items 
ORDER BY l1_distance LIMIT 5;

5. Hamming Distance (<~>) (Introduced in pgVector 0.7.0)

Formula: Counts the number of differing bits between two binary vectors.
Description: Works only with binary vectors and is used to measure bit-level differences.
Use Case: Useful in applications like DNA sequencing and hash comparisons.

领英推荐

Demystifying AI-Driven Data Engineering: Transforming…

Pronix Inc 7 个月前

Demystifying AI-Driven Data Engineering: Transforming…

Pronix Inc 7 个月前

Blueprint for Leveraging Vector Database in Business

Oak Business Consultant 9 个月前

Example Query:

SELECT id, embedding, embedding <~> '[1, 0, 1]' AS hamming_distance 
FROM items 
ORDER BY hamming_distance LIMIT 5;

6. Jaccard Distance (<%>) (Introduced in pgVector 0.7.0)

Formula:
Description: Measures dissimilarity between two sets represented as binary vectors.
Use Case: Ideal for categorical data, document comparisons, or set similarity.

Example Query:

SELECT id, embedding, embedding <%> '[1, 0, 1]' AS jaccard_distance 
FROM items 
ORDER BY jaccard_distance LIMIT 5;

Boosting Query Performance with Indexing

When working with large datasets, indexing is critical for speeding up similarity searches. pgVector supports the following types of indexes:

1. HNSW Index (Hierarchical Navigable Small World)

Description: A graph-based index designed for fast approximate nearest neighbor searches.
Use Case: Best suited for real-time or low-latency applications with large datasets.

Example:

CREATE INDEX hnsw_index ON items USING hnsw (embedding);

2. Ivfflat Index (Inverted File Flat)

Description: Partitions vectors into clusters for efficient similarity searches.
Use Case: Works well for approximate searches with trade-offs in accuracy and speed.

Example:

CREATE INDEX ivfflat_index ON items USING ivfflat (embedding) WITH (lists = 100);

Choosing the Right Distance Metric and Index

The choice of distance metric and index depends on your application:

L2 Distance
Cosine Distance + Ivfflat Index: Great for text similarity or recommendation systems.
Hamming Distance + HNSW Index: Perfect for binary vector searches.

Conclusion

pgVector bridges the gap between traditional relational databases and modern AI-driven applications by enabling efficient vector operations directly in PostgreSQL. With its rich support for distance metrics and indexing techniques, it’s a powerful tool for building intelligent, scalable systems.

Explore pgVector for your next AI-powered application and unlock the full potential of vector embeddings within PostgreSQL.

GANESH KOLNURE

2 个月

Very informative

2 次回应

要查看或添加评论，请登录

Zahir Shaikh的更多文章

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

2025年3月19日

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

Introduction Enterprises are increasingly exploring generative AI to improve productivity, customer service, and…
Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

2025年1月29日

Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

1. Introduction to the Buzz About DeepSeek DeepSeek-R1-Zero has been making waves in the AI research community with its…

3 条评论
Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

2025年1月26日

Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

Kubeflow is a powerful open-source platform designed for running machine learning workflows on Kubernetes. While…
How to Win in 2025 with Open-Source AI

2025年1月2日

How to Win in 2025 with Open-Source AI

Introduction Open-source AI has made impressive strides, matching or even surpassing older closed-source models. Yet…

1 条评论
AI Agents: TapeAgent from ServiceNow AI Research

2024年11月28日

AI Agents: TapeAgent from ServiceNow AI Research

An In-Depth Exploration with a Short PoC AI agent development and deployment are advancing rapidly, driven by the…
Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

2024年11月15日

Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

TinyTroupe framework by Microsoft is a Python library designed to create generative agent systems, where AI-powered…
?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

2024年10月26日

?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

Generative AI is transforming industries by enabling automated content creation, intelligent assistance, and…
From Reasoning to Action: Understanding AI Agents With Simple Program

2024年10月15日

From Reasoning to Action: Understanding AI Agents With Simple Program

Artificial Intelligence (AI) continues to evolve, and one of the most exciting developments is the concept of AI…
Improving RAG Search with Reranking: Try with simple python program

2024年10月9日

Improving RAG Search with Reranking: Try with simple python program

Retrieval-Augmented Generation (RAG) has gained significant traction in enhancing the capabilities of generative AI…
Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

2024年10月8日

Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

In deep learning, fine-tuning pre-trained models for specific tasks has become a common practice. However, traditional…

1 条评论

See all articles

Unlocking the Power of pgVector: Distance Functions and Indexing Explained

Zahir Shaikh

Lead (Generative AI / Automation) @ T-Systems | Specializing in Automation, Large Language Models (LLM), LLAMA Index, Langchain | Expert in Deep Learning, Machine Learning, NLP, Vector Databases | RPA

What is pgVector?

Distance Functions in pgVector

1. L2 Distance (<->)

2. Negative Inner Product (<#>)

3. Cosine Distance (<=>)

4. L1 Distance (<+>) (Introduced in pgVector 0.7.0)

5. Hamming Distance (<~>) (Introduced in pgVector 0.7.0)

领英推荐

6. Jaccard Distance (<%>) (Introduced in pgVector 0.7.0)

Boosting Query Performance with Indexing

1. HNSW Index (Hierarchical Navigable Small World)

2. Ivfflat Index (Inverted File Flat)

Choosing the Right Distance Metric and Index

Conclusion

Zahir Shaikh的更多文章

社区洞察

其他会员也浏览了

Top 10 Future Trends in Data Science to Follow in 2024

Conversational BI: the art of querying Databases in Natural Language

How Enterprise Data Observability will make the most of your Shiny New Vector Databases

Advancements in Approximate Nearest Neighbor Algorithms: The Evolution of HNSW Algorithm

Vector Databases vs. Knowledge Graphs: Choosing the Right Foundation for Retrieval-Augmented Generation

Skills and Tools that will Future-Proof Your Data Science Career

Analytics and Data Science News for the Week of October 25; Updates from Starburst, UC San Diego, Cambridge Advance Online & More

vectordb

How Are Applications Like Harbor, Charger, and Copilot Created

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis

What is pgVector?

Distance Functions in pgVector

1. L2 Distance (<->)

2. Negative Inner Product (<#>)

3. Cosine Distance (<=>)

4. L1 Distance (<+>) (Introduced in pgVector 0.7.0)

5. Hamming Distance (<~>) (Introduced in pgVector 0.7.0)

领英推荐

6. Jaccard Distance (<%>) (Introduced in pgVector 0.7.0)

Boosting Query Performance with Indexing

1. HNSW Index (Hierarchical Navigable Small World)

2. Ivfflat Index (Inverted File Flat)

Choosing the Right Distance Metric and Index

Conclusion

Zahir Shaikh的更多文章

Enterprise Ready? Overcoming the Hidden Hurdles of Generative AI

Group Relative Policy Optimization (GRPO) in Reinforcement Learning from Human Feedback (RLHF): Insights from DeepSeek

Comprehensive Guide to Installing Kubeflow Locally on Ubuntu 22.04

How to Win in 2025 with Open-Source AI

AI Agents: TapeAgent from ServiceNow AI Research

Exploring Microsoft TinyTroupe: A Framework for Generative Agent Collaboration

?? Basics of Docker, Kubernetes, and Helm for Generative AI Applications (Try it on Ubuntu)

From Reasoning to Action: Understanding AI Agents With Simple Program

Improving RAG Search with Reranking: Try with simple python program

Understanding LoRA (Low-Rank Adaptation) with simple example in Pytorch

社区洞察

其他会员也浏览了

Top 10 Future Trends in Data Science to Follow in 2024

Conversational BI: the art of querying Databases in Natural Language

How Enterprise Data Observability will make the most of your Shiny New Vector Databases

Advancements in Approximate Nearest Neighbor Algorithms: The Evolution of HNSW Algorithm

Vector Databases vs. Knowledge Graphs: Choosing the Right Foundation for Retrieval-Augmented Generation

Skills and Tools that will Future-Proof Your Data Science Career

Analytics and Data Science News for the Week of October 25; Updates from Starburst, UC San Diego, Cambridge Advance Online & More

vectordb

How Are Applications Like Harbor, Charger, and Copilot Created

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis