Vector Databases - What you need to know
Dennis Layton
A Senior IT architect, now considered retired. I remain a proponent for AI literacy and the safe and ethical adoption of AI. I write regularly on Linked In and Medium (Dennis Layton).
In the era of big data and artificial intelligence, organizations are constantly seeking innovative ways to manage, analyze, and extract value from vast amounts of information, particularly information in unstructured form. Recently Vector databases have emerged as a groundbreaking approach to organize and search complex, high-dimensional data and when you combine this capability with Large Language Models (LLMs) like GPT-4, the most advanced language model yet developed by OpenAI, vector databases pave the way for an array of applications that leverage natural language processing, machine learning, and deep learning technologies. From improving search engine relevance to generating personalized content, this powerful combination promises to unlock unprecedented potential across a wide range of industries.
In 2021, Pinecone launched a vector database primarily targeting data scientists. However, more recently, the company has started emphasizing AI-driven semantic search. With the emergence of AI powered by Large Language Models (LLMs), businesses are recognizing the increased value of vector databases. This sentiment is shared by investors as well. Notably, the company recently announced a $100 million Series B investment, resulting in a post valuation of $750 million. Pinecone is not the only vector database attracting funding; Qdrant, Zilliz, and Chroma have also received investments.
So, what is the reason behind this sudden interest in vector databases and what exactly are they?
Key Concepts
Vector databases are an entirely new type of database, so it’s important to understand a few key concepts.
High-Dimensional Data
In the context of data analysis and statistics, “high-dimensional” refers to datasets that have a large number of variables or features. Each variable represents a different characteristic or attribute of the data, and the more variables there are, the higher the dimensionality of the dataset.
In the context of Natural Language Processing (NLP), “high-dimensional” data typically refers to representing textual information using a large number of features like semantic meaning, syntactic relationships, or contextual information.
High-dimensional word embeddings can have hundreds or even thousands of dimensions. For example, in OpenAI’s embedding model, there are 1536 dimensions.
Embeddings
Embeddings are compact numerical representations of complex data objects, such as words, sentences, images, or graphs, in a continuous vector space. These representations capture the essential properties and relationships between the objects they represent. Embeddings provide the high-dimensional vectors that are stored, indexed, and retrieved in vector databases. Embeddings are used to represent complex data objects and their relationships.
Vector Database
A vector database is a type of database specifically designed to store, index, and retrieve high-dimensional vectors efficiently. In other words, vector databases are means of persisting embeddings so that they can be reused.
Semantic Search
Semantic search is the process of searching for information or documents based on the meaning and context of the query, rather than relying solely on keyword matching. It involves understanding the intent of the user’s query and the relationships between words, phrases, or concepts within the data.
Embeddings provide compact numerical representations of data objects that capture their meaning and relationships. They enable the measurement of semantic similarity and facilitate efficient similarity search.
Introducing ChromaDB
For the rest of this article, I will use ChromaDB, a commercial open-source vector database. You can learn more about Chroma here:
The diagram below shows how one piece of text with embeddings is stored in ChromaDB. A set of these items is known as a Collection. Collections are where you'll store your embeddings, text, and any additional metadata. You can create a collection with a name.
You assign your own unique ids, to identify items in the collection. You can also assign metadata for each item in the database, this is a dictionary of key value pairs. If your text is broken into chapters and verses you can assign the corresponding key value pairs for each piece of text. This metadata can be used later in a query. That said, I won't go any further on metadata in this introductory article.
Sam Altman Interview
For this article, I am going to use a transcript from a recent interview with Sam Altman which covers a wide range of topics. As an aside, this transcript was generated using OpenAI's Whisper API.
领英推荐
The Semantic Search Query
For demonstration purposes, I wanted to create a simple query that would yield a result not easily found through a keyword search. During the interview, Sam Altman frequently mentioned social media, specifically referring to Facebook and Twitter by name.
In cases where the quantity of text exceeds the limits of what can be included in a prompt, an alternative approach can be adopted. For instance, if the interview transcript is too long, surpassing the prompt size, other methods can be employed to gain insights from the text corpus.
The approach used in this article involves breaking down the text into smaller chunks. A rough guideline of 500 characters was used to create these text chunks. Subsequently, an OpenAI embedding model was utilized to obtain numerical embeddings, which are represented as lists of floating-point numbers with 1536 dimensions, for each text chunk.
To facilitate further analysis and retrieval, each chunk, along with its corresponding embedding and an assigned ID, was added to a collection in ChromaDB. Once all the text chunks were processed and added, the collection was saved to disk for persistence.
By employing this methodology, a range of semantic search queries can be executed on the stored embeddings within the ChromaDB collection. These queries enable the extraction of valuable insights and information by identifying relevant similarities and patterns within the embeddings.
While the coding details of this process are not covered in this article, they can be explored in future articles that delve into the implementation details of this approach.
By breaking down the text, leveraging embeddings, and utilizing semantic search, it becomes feasible to gain insights from large text corpora, even when exceeding the limitations of prompt size. This method offers an effective means of handling and extracting valuable information from extensive textual data.
The query I utilized was as follows:
“What did Sam have to say about social media?”
The semantic search returned the following segments of the interview. These segments are ranked based on their similarity to the query. As part of the query, I requested the top 5 segments ranked by similarity."And it's also not about the mass spread and the challenges that I think may have made the Twitter and Facebook and others have struggled with so much. So we will have very significant challenges, but they'll be very new and very different. And maybe, yeah, very new, very different is a good way to put it. There could be truths that are harmful and there are truths, I don't know, group differences in IQ. There you go. Scientific work that when spoken might do more harm."
"Man, remember when the social media platforms were banning people for saying it was a lab leak? Yeah. That's really humbling. The humbling, the overreach of power in censorship. But the more powerful GPT becomes, the more pressure there will be to censor. We have a different set of challenges faced by the previous generation of companies, which is people talk about free speech issues with GPT, but it's not quite the same thing. It's not like this is a computer program, what it's allowed to say."
"I mean, it's starting to get more, I guess. - So these systems deployed at scale can shift the winds of geopolitics and so on. - How would we know if like on Twitter we were mostly having like LLMs direct the, whatever's flowing through that hive mind? Yeah, on Twitter and then perhaps beyond. And then as on Twitter, so everywhere else eventually. Yeah, how would we know? My statement is we wouldn't. And that's a real danger."
"- But from what I hear people using it for, at least the people I talk to, and from what I see on Twitter, we are definitely mostly good, but, A, not all of us are, all of the time, And B, we really wanna push on the edges of these systems. We really wanna test out some darker theories of the world. - Yeah, it's very interesting. It's very interesting."
"People are like, you know, in the debates about what showed up in the Facebook feed, I, having listened to a lot of people talk about that, everyone is like, well, it doesn't matter what's in my feed because I won't be radicalized. I can handle anything. But I really worry about what Facebook shows you. I would love it if there's some way, which I think my interaction with GPT has already done that, some way to, in a nuanced way, present the tension of ideas."
Upon reviewing these excerpts from the interview, it becomes evident that either the phrase “social media” is explicitly mentioned or there are direct references to Twitter or Facebook. In essence, the text embeddings generated by the OpenAI model successfully captured the subtlety that Twitter and Facebook are companies within the realm of social media. This highlights the inherent strength of semantic search.
ChatGPT
At the beginning of this article, we discussed the potential of integrating the capabilities of LLMs, such as GPT-4, with vector databases. To showcase this integration, we will now take the text obtained from the semantic search and incorporate it into a prompt for ChatGPT.
The prompt was as follows:
“Summarize what Sam had to say about Social Media based on the following excerpts from a recent interview: [text excerpts ...]”
Sam Altman makes several statements regarding the challenges faced by social media platforms and the potential risks associated with advanced AI systems like GPT. Here is a summary of the key points:
In summary, Altman raises concerns about the challenges faced by social media platforms, the potential dangers associated with AI systems like GPT, and the need for nuanced and balanced content presentation to mitigate risks and promote healthy discussions.
This was a very simple prompt, ChatGPT chose to format the output as a list of key points. I could have just as easily asked for summary suitable for a blog or YouTube video.
The key idea is that I utilized NLP techniques, specifically semantic search, to extract pertinent sections from a vast text collection stored in a vector database. I then incorporated these extracted excerpts as a prompt for a language model, enabling me to obtain additional insights in any desired format.
Summary
This article has delved into the innovative approach of combining vector databases with Language Models (LLMs) to unlock a new level of insights. By leveraging the power of Natural Language Processing (NLP) techniques like semantic search, we have shown how to extract relevant sections from a large text corpus stored in a vector database. These extracted excerpts were then used as part of a prompt in ChatGPT, enabling the generation of a comprehensive and contextualized insight.
Just a few months ago, extending the knowledge of LLMs like GPT-4 seemed possible only through costly fine-tuning. Enter vector databases, transforming the game entirely. They provide the means to persist institutional knowledge in large text-based repositories. Integrated with LLMs they can provide problem solving, even transformative insights.