登录查看更多内容

Vector Databases - What you need to know

Dennis Layton

A Senior IT architect, now considered retired. I remain a proponent for AI literacy and the safe and ethical adoption of AI. I write regularly on Linked In and Medium (Dennis Layton).

发布日期: 2023年5月19日

In the era of big data and artificial intelligence, organizations are constantly seeking innovative ways to manage, analyze, and extract value from vast amounts of information, particularly information in unstructured form. Recently Vector databases have emerged as a groundbreaking approach to organize and search complex, high-dimensional data and when you combine this capability with Large Language Models (LLMs) like GPT-4, the most advanced language model yet developed by OpenAI, vector databases pave the way for an array of applications that leverage natural language processing, machine learning, and deep learning technologies. From improving search engine relevance to generating personalized content, this powerful combination promises to unlock unprecedented potential across a wide range of industries.

In 2021, Pinecone launched a vector database primarily targeting data scientists. However, more recently, the company has started emphasizing AI-driven semantic search. With the emergence of AI powered by Large Language Models (LLMs), businesses are recognizing the increased value of vector databases. This sentiment is shared by investors as well. Notably, the company recently announced a $100 million Series B investment, resulting in a post valuation of $750 million. Pinecone is not the only vector database attracting funding; Qdrant, Zilliz, and Chroma have also received investments.

So, what is the reason behind this sudden interest in vector databases and what exactly are they?

Key Concepts

Vector databases are an entirely new type of database, so it’s important to understand a few key concepts.

High-Dimensional Data

In the context of data analysis and statistics, “high-dimensional” refers to datasets that have a large number of variables or features. Each variable represents a different characteristic or attribute of the data, and the more variables there are, the higher the dimensionality of the dataset.

In the context of Natural Language Processing (NLP), “high-dimensional” data typically refers to representing textual information using a large number of features like semantic meaning, syntactic relationships, or contextual information.

High-dimensional word embeddings can have hundreds or even thousands of dimensions. For example, in OpenAI’s embedding model, there are 1536 dimensions.

Embeddings

Embeddings are compact numerical representations of complex data objects, such as words, sentences, images, or graphs, in a continuous vector space. These representations capture the essential properties and relationships between the objects they represent. Embeddings provide the high-dimensional vectors that are stored, indexed, and retrieved in vector databases. Embeddings are used to represent complex data objects and their relationships.

Vector Database

A vector database is a type of database specifically designed to store, index, and retrieve high-dimensional vectors efficiently. In other words, vector databases are means of persisting embeddings so that they can be reused.

Semantic Search

Semantic search is the process of searching for information or documents based on the meaning and context of the query, rather than relying solely on keyword matching. It involves understanding the intent of the user’s query and the relationships between words, phrases, or concepts within the data.

Embeddings provide compact numerical representations of data objects that capture their meaning and relationships. They enable the measurement of semantic similarity and facilitate efficient similarity search.

Introducing ChromaDB

For the rest of this article, I will use ChromaDB, a commercial open-source vector database. You can learn more about Chroma here:

The diagram below shows how one piece of text with embeddings is stored in ChromaDB. A set of these items is known as a Collection. Collections are where you'll store your embeddings, text, and any additional metadata. You can create a collection with a name.

No alt text provided for this image — A single item in a ChromaDB Collection

You assign your own unique ids, to identify items in the collection. You can also assign metadata for each item in the database, this is a dictionary of key value pairs. If your text is broken into chapters and verses you can assign the corresponding key value pairs for each piece of text. This metadata can be used later in a query. That said, I won't go any further on metadata in this introductory article.

Sam Altman Interview

For this article, I am going to use a transcript from a recent interview with Sam Altman which covers a wide range of topics. As an aside, this transcript was generated using OpenAI's Whisper API.

领英推荐

RAG Techniques Every AI/ML/Data Engineer Should Know!

Pavan Belagatti 6 个月前

AI News #4. The growing relevance of semantic search

Avenga 7 个月前

Comparing AI search solutions in a crowded market…

Algolia 1 年前

The Semantic Search Query

For demonstration purposes, I wanted to create a simple query that would yield a result not easily found through a keyword search. During the interview, Sam Altman frequently mentioned social media, specifically referring to Facebook and Twitter by name.

In cases where the quantity of text exceeds the limits of what can be included in a prompt, an alternative approach can be adopted. For instance, if the interview transcript is too long, surpassing the prompt size, other methods can be employed to gain insights from the text corpus.

The approach used in this article involves breaking down the text into smaller chunks. A rough guideline of 500 characters was used to create these text chunks. Subsequently, an OpenAI embedding model was utilized to obtain numerical embeddings, which are represented as lists of floating-point numbers with 1536 dimensions, for each text chunk.

To facilitate further analysis and retrieval, each chunk, along with its corresponding embedding and an assigned ID, was added to a collection in ChromaDB. Once all the text chunks were processed and added, the collection was saved to disk for persistence.

By employing this methodology, a range of semantic search queries can be executed on the stored embeddings within the ChromaDB collection. These queries enable the extraction of valuable insights and information by identifying relevant similarities and patterns within the embeddings.

While the coding details of this process are not covered in this article, they can be explored in future articles that delve into the implementation details of this approach.

By breaking down the text, leveraging embeddings, and utilizing semantic search, it becomes feasible to gain insights from large text corpora, even when exceeding the limitations of prompt size. This method offers an effective means of handling and extracting valuable information from extensive textual data.

The query I utilized was as follows:

“What did Sam have to say about social media?”

The semantic search returned the following segments of the interview. These segments are ranked based on their similarity to the query. As part of the query, I requested the top 5 segments ranked by similarity."And it's also not about the mass spread and the challenges that I think may have made the Twitter and Facebook and others have struggled with so much. So we will have very significant challenges, but they'll be very new and very different. And maybe, yeah, very new, very different is a good way to put it. There could be truths that are harmful and there are truths, I don't know, group differences in IQ. There you go. Scientific work that when spoken might do more harm."

"Man, remember when the social media platforms were banning people for saying it was a lab leak? Yeah. That's really humbling. The humbling, the overreach of power in censorship. But the more powerful GPT becomes, the more pressure there will be to censor. We have a different set of challenges faced by the previous generation of companies, which is people talk about free speech issues with GPT, but it's not quite the same thing. It's not like this is a computer program, what it's allowed to say."

"I mean, it's starting to get more, I guess. - So these systems deployed at scale can shift the winds of geopolitics and so on. - How would we know if like on Twitter we were mostly having like LLMs direct the, whatever's flowing through that hive mind? Yeah, on Twitter and then perhaps beyond. And then as on Twitter, so everywhere else eventually. Yeah, how would we know? My statement is we wouldn't. And that's a real danger."

"- But from what I hear people using it for, at least the people I talk to, and from what I see on Twitter, we are definitely mostly good, but, A, not all of us are, all of the time, And B, we really wanna push on the edges of these systems. We really wanna test out some darker theories of the world. - Yeah, it's very interesting. It's very interesting."

"People are like, you know, in the debates about what showed up in the Facebook feed, I, having listened to a lot of people talk about that, everyone is like, well, it doesn't matter what's in my feed because I won't be radicalized. I can handle anything. But I really worry about what Facebook shows you. I would love it if there's some way, which I think my interaction with GPT has already done that, some way to, in a nuanced way, present the tension of ideas."

Upon reviewing these excerpts from the interview, it becomes evident that either the phrase “social media” is explicitly mentioned or there are direct references to Twitter or Facebook. In essence, the text embeddings generated by the OpenAI model successfully captured the subtlety that Twitter and Facebook are companies within the realm of social media. This highlights the inherent strength of semantic search.

ChatGPT

At the beginning of this article, we discussed the potential of integrating the capabilities of LLMs, such as GPT-4, with vector databases. To showcase this integration, we will now take the text obtained from the semantic search and incorporate it into a prompt for ChatGPT.

The prompt was as follows:

“Summarize what Sam had to say about Social Media based on the following excerpts from a recent interview: [text excerpts ...]”

Sam Altman makes several statements regarding the challenges faced by social media platforms and the potential risks associated with advanced AI systems like GPT. Here is a summary of the key points:

Altman acknowledges that social media platforms like Twitter and Facebook have struggled with mass spread and harmful content. However, he believes that future challenges will be different and new.
He mentions that certain truths, such as group differences in IQ, can be scientifically valid but may be harmful when spoken. This suggests that there is a need to consider the potential harm caused by sharing certain information.
Altman reflects on the past censorship of discussions about a lab leak, highlighting the humbling nature of the power wielded by social media platforms in determining what can be said.
He notes that GPT and similar AI systems have a different set of challenges compared to previous generation companies, particularly in relation to free speech issues. He emphasizes that GPT's limitations should not be equated with human-like agency in deciding what it can say.
Altman expresses concerns about the potential impact of AI systems on geopolitics and questions how we would know if these systems were directing discussions and influencing public opinion without our awareness.
He acknowledges that, overall, people are mostly using these systems for good purposes, but there are still individuals and instances where harmful content arises. He mentions the desire to test boundaries and explore darker theories of the world, pushing the limits of these systems.
Altman highlights the importance of presenting nuanced tensions of ideas, both in social media feeds and through interactions with AI systems like GPT. He suggests that such presentation could potentially mitigate the risks of radicalization and provide a balanced perspective.

In summary, Altman raises concerns about the challenges faced by social media platforms, the potential dangers associated with AI systems like GPT, and the need for nuanced and balanced content presentation to mitigate risks and promote healthy discussions.

This was a very simple prompt, ChatGPT chose to format the output as a list of key points. I could have just as easily asked for summary suitable for a blog or YouTube video.

The key idea is that I utilized NLP techniques, specifically semantic search, to extract pertinent sections from a vast text collection stored in a vector database. I then incorporated these extracted excerpts as a prompt for a language model, enabling me to obtain additional insights in any desired format.

Summary

This article has delved into the innovative approach of combining vector databases with Language Models (LLMs) to unlock a new level of insights. By leveraging the power of Natural Language Processing (NLP) techniques like semantic search, we have shown how to extract relevant sections from a large text corpus stored in a vector database. These extracted excerpts were then used as part of a prompt in ChatGPT, enabling the generation of a comprehensive and contextualized insight.

Just a few months ago, extending the knowledge of LLMs like GPT-4 seemed possible only through costly fine-tuning. Enter vector databases, transforming the game entirely. They provide the means to persist institutional knowledge in large text-based repositories. Integrated with LLMs they can provide problem solving, even transformative insights.

要查看或添加评论，请登录

Dennis Layton的更多文章

The Difference Between GPT 4o and GPT 4.5

2025年3月1日

The Difference Between GPT 4o and GPT 4.5

Image by Midjourney 6.1 The difference between the newly released GPT4o model and the newly released GPT 4.

4 条评论
Why the SWE-Lancer Benchmark is Important

2025年2月20日

Why the SWE-Lancer Benchmark is Important

We are all suffering from AI benchmark fatigue particularly when it comes to AI performing Software Engineering (SWE)…
Pydantic AI?—?Agents Made Simpler

2025年1月19日

Pydantic AI?—?Agents Made Simpler

Pydantic’s Role in AI Agent Frameworks Pydantic is a relatively new entrant in the AI agent framework space. However…

1 条评论
AI in 2025: Key Trends That Will Shape Our Future

2025年1月1日

AI in 2025: Key Trends That Will Shape Our Future

This article isn’t about making predictions for AI in 2025. Instead, it’s about the topics I expect I’ll be writing…
ChatGPT Pro from OpenAI Costs $200 a Month

2024年12月7日

ChatGPT Pro from OpenAI Costs $200 a Month

People are questioning who would pay $200 a month for ChatGPT Pro, but consider this: the global management consulting…

4 条评论
Being Productive with ChatGPT: 5 Facts About the New “Work with Apps” Feature Introduction

2024年11月20日

Being Productive with ChatGPT: 5 Facts About the New “Work with Apps” Feature Introduction

OpenAI has enhanced its ChatGPT desktop application for macOS by introducing the “Work with Apps” feature, enabling…
Web Search from OpenAI - 5 Things You May Not Know

2024年11月2日

Web Search from OpenAI - 5 Things You May Not Know

Introduction What better way to introduce the web search capability in ChatGPT than to use that capability to get the…
Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment -- How it was done

2024年10月21日

Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment -- How it was done

OpenAI recently introduced a framework called Swarm. While not a new concept, Swarm is designed primarily for…
Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment

2024年10月21日

Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment

OpenAI recently introduced a framework called Swarm. While not a new concept, Swarm is designed primarily for…

6 条评论
Unlocking the Full Potential of OpenAI's Canvas: 10 Expert Tips

2024年10月11日

Unlocking the Full Potential of OpenAI's Canvas: 10 Expert Tips

Introduction After spending a couple of weeks using the new canvas user interface in ChatGPT, I found it to be far more…

See all articles

Vector Databases - What you need to know

Dennis Layton

A Senior IT architect, now considered retired. I remain a proponent for AI literacy and the safe and ethical adoption of AI. I write regularly on Linked In and Medium (Dennis Layton).

Key Concepts

High-Dimensional Data

Embeddings

Vector Database

Semantic Search

Introducing ChromaDB

领英推荐

Summary

Dennis Layton的更多文章

社区洞察

其他会员也浏览了

Exploring RAG with LangChain

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

From Data to Intelligence: How Knowledge Graphs are Shaping the Future

Building a Spatial Agentic System Using Machine Learning and Local LLMs

A deep dive on Vector Search and its implementation

Addressing Latency Issues in AI-Powered Search with Vector Databases

Unlocking the Power of Local Large Language Models with Llamafiles — Part 01

Important question today is “Should you have Vector DB on-premise or not?”

AI-Powered Transformation (3rd Episode) - Guided By Data

Exploring Text Analytics: Unveiling Insights from Unstructured Data

Key Concepts

High-Dimensional Data

Embeddings

Vector Database

Semantic Search

Introducing ChromaDB

领英推荐

Summary

Dennis Layton的更多文章

The Difference Between GPT 4o and GPT 4.5

Why the SWE-Lancer Benchmark is Important

Pydantic AI?—?Agents Made Simpler

AI in 2025: Key Trends That Will Shape Our Future

ChatGPT Pro from OpenAI Costs $200 a Month

Being Productive with ChatGPT: 5 Facts About the New “Work with Apps” Feature Introduction

Web Search from OpenAI - 5 Things You May Not Know

Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment -- How it was done

Poetry Analysis by a Swarm of AI Agents: A Multi-Agent Experiment

Unlocking the Full Potential of OpenAI's Canvas: 10 Expert Tips

社区洞察

其他会员也浏览了

Exploring RAG with LangChain

The Semantic Web Project Revitalized: From Vision to Reality with Reasoning and Inference

From Data to Intelligence: How Knowledge Graphs are Shaping the Future

Building a Spatial Agentic System Using Machine Learning and Local LLMs

A deep dive on Vector Search and its implementation

Addressing Latency Issues in AI-Powered Search with Vector Databases

Unlocking the Power of Local Large Language Models with Llamafiles — Part 01

Important question today is “Should you have Vector DB on-premise or not?”

AI-Powered Transformation (3rd Episode) - Guided By Data

Exploring Text Analytics: Unveiling Insights from Unstructured Data