23-5 Chroma Vector Store with Langchain - Local Retrieval with HTML docs
I had trouble which Vector DB to examine next after Pinecone. This handy diagram neatly divides the Vector DB landscape into primarily Self-Host/Open-Source and Managed DB in the Cloud. Somewhat giving a clue which to cover.
Chroma DB was selected in Langchain State of AI 2023 as developer's choice for local vector storage.
How we can interpret the results is really up to us and what we need for our RAG applications. On a more wary note, it is too early to determine which Vector DB is best overall and the results more weighed in favour of Langchain developers' preferences. In fact those who are not Langchain users might prefer other options but we simply cannot know. But we need to start somewhere.
Upon reviewing Chroma's Homepage
Chroma gives you the tools to:
Chroma prioritizes:
Chroma offers self-hosting your own and the cloud managed service (which will be released soon).
To get started, it is really that simple. It runs in a notebook.
pip install chromadb
import chromadb
This installs both the chromadb locally and provides the python SDK to interact with the vector store. No sign up or API keys needed.
Initializing the client is similar to how we did for Pinecone.
chroma_client = chromadb.Client()
Instead of Index, Chroma calls it a Collection. The following is a command to create "my_collection"
collection = chroma_client.create_collection(name="my_collection")
Chroma's diagram perhaps sums a typical RAG life cycle. For now we focus on the Queries, Embeddings and Retrieval. In another session, we can cover Generation.
!pip install chromadb openai langchain tiktoken
I will be using OpenAI's Prompt Engineering Guide: Prompt engineering - OpenAI API. for our reference material.
We will need to convert it to a format we can load to Chroma, including, of course, with the OpenAI embeddings. Langchain naturally supports ways to parse content from a URL but that is beyond scope.
Clipper is a Node.js command line tool that allows you to easily clip content from web pages and convert it to Markdown. This requires the use of the Terminal and some preparation bogging us down. Use this for converting online sources in bulk, keeping in mind copyright and fair use policies.
For simplicity, I will use MarkDownload a browser extension: MarkDownload - Markdown Web Clipper ( google.com ). This downloads the MD formated file to your Downloads directory
领英推荐
Let's simplify the processing of integrating and embedding with Chroma using Langchain.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
# load the document and split it into chunks
loader = TextLoader("name_of_the_file.txt")
documents = loader.load()
We will upload the markdown file into colab. The code above will load the file and store it in documents.
Simply drag and drop the file to colab files. Note the file will not be stored permanently and will be lost when the session is terminated.
# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(documents)
# create the OpenAI embedding function
embedding_function = OpenAIEmbeddings(model="text-embedding-ada-002")
We can't store all the to-be embedded text in one huge chunk. Langchain offers various methods to split the text in desired sizes with overlapping. The latter is needed if you want to retain the context from the previous entry/sentence/paragraph.
We also need an Embedding function. We use Langchain's OpenAIEmbedding class with text-embedding-ada-002 model.
# load it into Chroma
db = Chroma.from_documents(all_splits, embedding_function)
We now store the split documents and use Langchain to load each documents using embedding function we created earlier. A lot of details have been abstracted away as you can see.
# query it
query = "What is a good tactic for hiding inner monologue?"
docs = db.similarity_search(query)
print(docs)
# print results
print(docs[0].page_content)
We are now ready to query our split documents from OpenAI's prompt engineering guide.
The previous tactic demonstrates that it is sometimes important for the model to reason in detail about a problem before answering a specific question. For some applications, the reasoning process that a model uses to arrive at a final answer would be inappropriate to share with the user. For example, in tutoring applications we may want to encourage students to work out their own answers, but a model’s reasoning process about the student’s solution could reveal the answer to the student. Inner monologue is a tactic that can be used to mitigate this. The idea of inner monologue is to instruct the model to put parts of the output that are meant to be hidden from the user into a structured format that makes parsing them easy. Then before presenting the output to the user, the output is parsed and only part of the output is made visible.
Unlike Pinecone, Chroma db we generated will not persist so we can alternatively save the data to be recalled again or, in the first place, host Chroma externally outside of our compute notebook where storage is ephemeral. A good local option is to go for Docker. ?? Deployment | Chroma ( trychroma.com )
query = "What is the work around for fixed context length, during a dialogue between a user and an assistant"
retriever = db.as_retriever(search_type="mmr")
retriever.get_relevant_documents(query)[0]
Other than matching results based on closest distance (consine) we can retrieve the relevant documents using the mmr which could be more handy for providing context for a LLM to respond back to user queries.
Other methods such as filtering based on metadata is possible. This is important for regulatory, role-based control purposes. For instance, if there is a document management system with Vector Search enabled, we want to ensure that Commercial user group in Pharma cannot examine the confidential documents/information by the Medical/Clinical team. Before any data is stored in the system, meta-data and extensive controls would help unwanted retrieval and generation.
I hope this has given you sufficient nudge to try Chroma out for yourself and integrate it into your applications/ Use cases.
Let's push ahead with the next Vector DB, Qdrant. Note that I will be posting more detailed tutorials on Chroma in future dates when the hosted service is available.
Here are today's notebooks to try out the code yourself.