Exploring Chroma DB: A Python Approach in Jupyter Notebooks
Rany ElHousieny, PhD???
SENIOR SOFTWARE ENGINEERING MANAGER (EX-Microsoft) | Generative AI / LLM / ML / AI Engineering Manager | AWS SOLUTIONS ARCHITECT CERTIFIED? | LLM and Machine Learning Engineer | AI Architect
Chroma DB represents the cutting edge in vector database technology, designed to bolster AI applications through efficient handling of embeddings. It has been used recently to support RAG. For data scientists and AI practitioners, leveraging Chroma DB through Python within the Jupyter Notebook environment offers an accessible yet powerful route to integrate this technology into their workflows. Below, we discuss how to get started with Chroma DB using Python, with an emphasis on practical examples you can execute in a Jupyter Notebook.
Getting Started with Chroma DB in Jupyter Notebooks
To start using Chroma DB in a Jupyter Notebook, ensure you have Jupyter installed, which is easily available via Anaconda or by running pip install notebook. Then, you need to install the Chroma DB Python package using pip:
!pip install chromadb
Launching Chroma DB Server
With Chroma DB installed, the next step is to launch the Chroma DB server. This step might typically be done outside the Jupyter environment in a terminal, but we can use Jupyter's '!' command to execute shell commands.
!chroma run --path /your/db/path
However, this will hold your Notebook. Let's run it on a terminal window
Interacting with Chroma DB
With the server up and running, you can now start interacting with Chroma DB through its Python client. Here's how you can create a new collection, add documents, and query the collection, all within your Jupyter notebook.
1 - Create a Chroma DB Client:
import chromadb
client = chromadb.HttpClient()
When you create a HttpClient instance in your Python code, it establishes a connection to the running Chroma DB server. This client knows how to communicate with the server—typically via HTTP requests. The server's address and port would be predefined or configured when you initialize the client. Since are running the server locally with default settings, the client will automatically connect to it without specifying the address.
If you monitor the server output, you will find the following two events after you establish the client connection:
INFO: [09-04-2024 17:53:31] ::1:56895 - "GET /api/v1/tenants/default_tenant HTTP/1.1" 200
INFO: [09-04-2024 17:53:31] ::1:56895 - "GET /api/v1/databases/default_database?tenant=default_tenant HTTP/1.1" 200
If you do not have the server running, you will get the following error:
ValueError: Could not connect to a Chroma server. Are you sure it is running?
2 - Create a Collection:
collection = client.create_collection("my_embeddings")
Here’s what each part of the line does:
When this line is executed, the client sends a request to Chroma DB to create a new collection with the name "my_embeddings". After the collection is created, it can be used to store, retrieve, and manipulate embeddings as needed by the application. The collection variable holds a reference to this newly created collection, which allows you to perform further operations on it, such as adding documents, querying, or updating entries.
you can check the debug window for the server and the following output will appear there:
INFO: [09-04-2024 17:57:53] ::1:52005 - "POST /api/v1/collections?tenant=default_tenant&database=default_database HTTP/1.1" 200
This debug info tells that we sent an HTTP POST request to the Chroma DB server to create a new collection named "my_embeddings" within the default tenant and database, and this request completed successfully.
3 - Add Documents to the Collection:
When adding documents, Chroma DB handles tokenization and embedding automatically. You can also add metadata for each document.
documents = ["The quick brown fox", "Jumps over the lazy dog"]
metadatas = [{"text_length": 19}, {"text_length": 23}]
ids = ["doc1", "doc2"]
collection.add(documents=documents, metadatas=metadatas, ids=ids)
Here's what's happening step by step:
领英推荐
Query the Collection:
Once your documents are indexed, you can perform a query. Here's how to retrieve the documents most similar to a given query text:
query_results = collection.query(query_texts=["Find documents similar to this text"], n_results=2)
print(query_results)
{'data': None,
'distances': [[1.8206149414329449, 1.890620414904181]],
'documents': [['The quick brown fox', 'Jumps over the lazy dog']],
'embeddings': None,
'ids': [['doc1', 'doc2']],
'metadatas': [[{'text_length': 19}, {'text_length': 23}]],
'uris': None}
Here's a breakdown of the components:
The query operation is essential in applications such as semantic search, recommendation systems, or any other domain requiring finding the most relevant items from a large dataset based on similarity to an input query.
{'data': None,
'distances': [[1.8206149414329449, 1.890620414904181]],
'documents': [['The quick brown fox', 'Jumps over the lazy dog']],
'embeddings': None,
'ids': [['doc1', 'doc2']],
'metadatas': [[{'text_length': 19}, {'text_length': 23}]],
'uris': None}
The output of the query is a dictionary containing the results of the search operation performed on the Chroma DB collection. Here's what each key in this dictionary represents:
- 'data': This field is None, indicating that no additional data was returned by the query beyond the standard fields.
- 'distances':
'distances': [[1.8206149414329449, 1.890620414904181]],
This list contains sublists, each corresponding to one query input. In this case, there is one sublist with two values. These values are the distances between the query text's embedding and the embeddings of the documents returned by the query. Lower distances mean higher similarity. Here, the distances [1.8206149414329449, 1.890620414904181] suggest that the first document ("The quick brown fox") is slightly more similar to the query text than the second document ("Jumps over the lazy dog") because it has a smaller distance.
- 'documents':
'documents': [['The quick brown fox', 'Jumps over the lazy dog']],
This list contains sublists of documents that were returned by the query. Each sublist corresponds to one query input. Since there was only one query text, there's a single sublist containing the two documents most similar to the query text: "The quick brown fox" and "Jumps over the lazy dog".
- 'embeddings':
This field is None, which indicates that the actual embeddings of the returned documents were not included in the query results. This is often the case when only the documents themselves are needed, and not their vector representations.
To retrieve the embeddings for both the query and the results in Chroma DB, you would adjust the parameters of your query to include the embeddings include=['embeddings'], as follows:
query_results = collection.query(
query_texts=["Find documents similar to this text"],
n_results=2,
include=['embeddings'] # or include=['metadatas']
)
In this adjusted query, the include parameter is a list of the fields you want the database to return. By including 'embeddings' in this list, you're asking the database to return the embeddings alongside the other details of the documents. Here is the result:
{'data': None,
'distances': None,
'documents': None,
'embeddings': [[[0.002767008962109685,
0.033265210688114166,
-0.0006877018604427576,
0.042998284101486206,
0.036148350685834885,
-0.033342938870191574,
0.05550643801689148,
-0.10481128096580505,
0.013740118592977524,
-0.012425346300005913,
0.006475344765931368,
-0.03193841874599457,
-0.06048757955431938,
0.010666296817362309,
-0.03226395696401596,
-0.02862873114645481,
-0.005726168397814035,
-0.050810422748327255,
-0.00272698444314301,
-0.04731421545147896,
-0.144442617893219,
0.005216713063418865,
...
0.08438645303249359]]],
'ids': [['doc1', 'doc2']],
'metadatas': None,
'uris': None}
- 'ids': This list contains sublists of the unique identifiers for the returned documents. Like 'documents', each sublist corresponds to one query input, with "doc1" and "doc2" being the IDs of the returned documents.
- 'metadatas': This list contains sublists of metadata for the returned documents, with each sublist corresponding to one query input. The metadata here includes a dictionary for each document, containing the length of the text: {'text_length': 19} for the first document and {'text_length': 23} for the second.
- 'uris': This field is None, suggesting that there are no URIs (Uniform Resource Identifiers) associated with the returned documents in this query. URIs, if used, would typically provide a way to access the document or its location in a database or on the internet.
In summary, the query returned two documents in response to the search query. For each document, it provides the text, its ID, the metadata about text length, and the distance from the query, indicating their relevance to the search term. The absence of actual embeddings or additional data suggests that the query was configured to return only specific pieces of information.
Advanced Querying with Metadata Filters:
Advanced querying can be done using metadata filters. For example, if you want to find documents of a certain length, you can use:
filtered_results = collection.query(
query_texts=["Find documents similar to this text"],
n_results=2,
where={"text_length": {"$gt": 20}}
)
print(filtered_results)
This will only return documents with a text_length metadata value greater than 20.
{'data': None,
'distances': [[1.890620414904181]],
'documents': [['Jumps over the lazy dog']],
'embeddings': None,
'ids': [['doc2']],
'metadatas': [[{'text_length': 23}]],
'uris': None}
Conclusion
Chroma DB brings an exciting new layer of efficiency and functionality to the realm of AI applications. By following the steps outlined above, you can integrate this technology within your Jupyter Notebooks using Python, streamlining your AI workflows from the retrieval of complex embeddings to querying based on intricate parameters. Whether you are working on NLP tasks, building recommendation systems, or any AI-driven application, Chroma DB can significantly enhance your capabilities in handling and utilizing embeddings.