Exploring Chroma DB: A Python Approach in Jupyter Notebooks

Exploring Chroma DB: A Python Approach in Jupyter Notebooks

Chroma DB represents the cutting edge in vector database technology, designed to bolster AI applications through efficient handling of embeddings. It has been used recently to support RAG. For data scientists and AI practitioners, leveraging Chroma DB through Python within the Jupyter Notebook environment offers an accessible yet powerful route to integrate this technology into their workflows. Below, we discuss how to get started with Chroma DB using Python, with an emphasis on practical examples you can execute in a Jupyter Notebook.

Getting Started with Chroma DB in Jupyter Notebooks

To start using Chroma DB in a Jupyter Notebook, ensure you have Jupyter installed, which is easily available via Anaconda or by running pip install notebook. Then, you need to install the Chroma DB Python package using pip:

!pip install chromadb        


Launching Chroma DB Server

With Chroma DB installed, the next step is to launch the Chroma DB server. This step might typically be done outside the Jupyter environment in a terminal, but we can use Jupyter's '!' command to execute shell commands.

!chroma run --path /your/db/path        

However, this will hold your Notebook. Let's run it on a terminal window


Interacting with Chroma DB

With the server up and running, you can now start interacting with Chroma DB through its Python client. Here's how you can create a new collection, add documents, and query the collection, all within your Jupyter notebook.

1 - Create a Chroma DB Client:

import chromadb
client = chromadb.HttpClient()        

When you create a HttpClient instance in your Python code, it establishes a connection to the running Chroma DB server. This client knows how to communicate with the server—typically via HTTP requests. The server's address and port would be predefined or configured when you initialize the client. Since are running the server locally with default settings, the client will automatically connect to it without specifying the address.

If you monitor the server output, you will find the following two events after you establish the client connection:

INFO:     [09-04-2024 17:53:31] ::1:56895 - "GET /api/v1/tenants/default_tenant HTTP/1.1" 200

INFO:     [09-04-2024 17:53:31] ::1:56895 - "GET /api/v1/databases/default_database?tenant=default_tenant HTTP/1.1" 200
        

If you do not have the server running, you will get the following error:

ValueError: Could not connect to a Chroma server. Are you sure it is running?
        

2 - Create a Collection:

collection = client.create_collection("my_embeddings")        

Here’s what each part of the line does:

  • client: This is an instance of a client object that acts as an interface between the Python script and the Chroma DB. It allows the script to send commands to the database server.
  • .create_collection(): This is a method provided by the Chroma DB client. A collection in Chroma DB (and in many NoSQL databases) is analogous to a table in a relational database. It's a container for storing related data—in this case, embeddings.
  • "my_embeddings": This is the name given to the new collection that you're creating. The collection will be used to store embeddings, which are high-dimensional vectors typically used to represent complex data like text or images in a form that a machine learning model can understand and process.

When this line is executed, the client sends a request to Chroma DB to create a new collection with the name "my_embeddings". After the collection is created, it can be used to store, retrieve, and manipulate embeddings as needed by the application. The collection variable holds a reference to this newly created collection, which allows you to perform further operations on it, such as adding documents, querying, or updating entries.


you can check the debug window for the server and the following output will appear there:

INFO:     [09-04-2024 17:57:53] ::1:52005 - "POST /api/v1/collections?tenant=default_tenant&database=default_database HTTP/1.1" 200
        

This debug info tells that we sent an HTTP POST request to the Chroma DB server to create a new collection named "my_embeddings" within the default tenant and database, and this request completed successfully.

3 - Add Documents to the Collection:

When adding documents, Chroma DB handles tokenization and embedding automatically. You can also add metadata for each document.

documents = ["The quick brown fox", "Jumps over the lazy dog"]
metadatas = [{"text_length": 19}, {"text_length": 23}]
ids = ["doc1", "doc2"]

collection.add(documents=documents, metadatas=metadatas, ids=ids)
        

Here's what's happening step by step:

  1. documents: This is a list of strings, each string being a document that you want to store in the database. The documents here are "The quick brown fox" and "Jumps over the lazy dog." These are often used as pangrams in English — sentences that use every letter of the alphabet at least once.
  2. metadatas: This is a list of dictionaries where each dictionary contains metadata about the corresponding document in the documents list. Metadata is data that provides information about other data. In this case, the metadata is describing the length of each document in terms of the number of characters. So, the first document, "The quick brown fox", has 19 characters, and the second document, "Jumps over the lazy dog", has 23 characters.
  3. ids: These are identifiers for the documents. In this list, "doc1" and "doc2" are the IDs for the first and second documents, respectively. These IDs are used to uniquely identify each document in the collection for retrieval and other operations.
  4. collection.add(): This function call is where the documents, along with their metadata and IDs, are being added to the collection. In a database context, a collection is a grouping of documents that can be thought of as being somewhat equivalent to a table in a relational database. The documents are stored in the collection, and each document is associated with its metadata and ID.


Query the Collection:

Once your documents are indexed, you can perform a query. Here's how to retrieve the documents most similar to a given query text:

query_results = collection.query(query_texts=["Find documents similar to this text"], n_results=2)

print(query_results)
        
{'data': None,
 'distances': [[1.8206149414329449, 1.890620414904181]],
 'documents': [['The quick brown fox', 'Jumps over the lazy dog']],
 'embeddings': None,
 'ids': [['doc1', 'doc2']],
 'metadatas': [[{'text_length': 19}, {'text_length': 23}]],
 'uris': None}        

Here's a breakdown of the components:

  • collection: This refers to the collection within Chroma DB where your documents are stored. A collection is like a container for your data, which, in the context of Chroma DB, likely consists of documents represented by embeddings.
  • .query(): This is a method used to retrieve documents from the collection. It performs a search to find entries that are most similar to the query inputs based on their embeddings.
  • query_texts: This argument specifies the texts you want to find similar documents for. In this case, the query text is "Find documents similar to this text". Chroma DB will use this string to generate an embedding and then search for documents with embeddings closest to this query embedding.
  • n_results=2: This argument tells Chroma DB how many results you want to return. n_results=2 means that the query should return the two most similar documents to the given query text from the collection.
  • query_results: This variable will store the results of the query. After executing this line of code, query_results will contain the top 2 documents from the collection that are most similar to the query text, based on the embeddings and the similarity metrics used by Chroma DB.

The query operation is essential in applications such as semantic search, recommendation systems, or any other domain requiring finding the most relevant items from a large dataset based on similarity to an input query.


{'data': None,
 'distances': [[1.8206149414329449, 1.890620414904181]],
 'documents': [['The quick brown fox', 'Jumps over the lazy dog']],
 'embeddings': None,
 'ids': [['doc1', 'doc2']],
 'metadatas': [[{'text_length': 19}, {'text_length': 23}]],
 'uris': None}        


The output of the query is a dictionary containing the results of the search operation performed on the Chroma DB collection. Here's what each key in this dictionary represents:

- 'data': This field is None, indicating that no additional data was returned by the query beyond the standard fields.


- 'distances':

'distances': [[1.8206149414329449, 1.890620414904181]],        

This list contains sublists, each corresponding to one query input. In this case, there is one sublist with two values. These values are the distances between the query text's embedding and the embeddings of the documents returned by the query. Lower distances mean higher similarity. Here, the distances [1.8206149414329449, 1.890620414904181] suggest that the first document ("The quick brown fox") is slightly more similar to the query text than the second document ("Jumps over the lazy dog") because it has a smaller distance.

- 'documents':

'documents': [['The quick brown fox', 'Jumps over the lazy dog']],        

This list contains sublists of documents that were returned by the query. Each sublist corresponds to one query input. Since there was only one query text, there's a single sublist containing the two documents most similar to the query text: "The quick brown fox" and "Jumps over the lazy dog".

- 'embeddings':


This field is None, which indicates that the actual embeddings of the returned documents were not included in the query results. This is often the case when only the documents themselves are needed, and not their vector representations.

To retrieve the embeddings for both the query and the results in Chroma DB, you would adjust the parameters of your query to include the embeddings include=['embeddings'], as follows:

query_results = collection.query(
    query_texts=["Find documents similar to this text"],
    n_results=2,
    include=['embeddings']  # or include=['metadatas']
)        

In this adjusted query, the include parameter is a list of the fields you want the database to return. By including 'embeddings' in this list, you're asking the database to return the embeddings alongside the other details of the documents. Here is the result:

{'data': None,
 'distances': None,
 'documents': None,
 'embeddings': [[[0.002767008962109685,
                  0.033265210688114166,
                  -0.0006877018604427576,
                  0.042998284101486206,
                  0.036148350685834885,
                  -0.033342938870191574,
                  0.05550643801689148,
                  -0.10481128096580505,
                  0.013740118592977524,
                  -0.012425346300005913,
                  0.006475344765931368,
                  -0.03193841874599457,
                  -0.06048757955431938,
                  0.010666296817362309,
                  -0.03226395696401596,
                  -0.02862873114645481,
                  -0.005726168397814035,
                  -0.050810422748327255,
                  -0.00272698444314301,
                  -0.04731421545147896,
                  -0.144442617893219,
                  0.005216713063418865,
...
                  0.08438645303249359]]],
 'ids': [['doc1', 'doc2']],
 'metadatas': None,
 'uris': None}        



- 'ids': This list contains sublists of the unique identifiers for the returned documents. Like 'documents', each sublist corresponds to one query input, with "doc1" and "doc2" being the IDs of the returned documents.

- 'metadatas': This list contains sublists of metadata for the returned documents, with each sublist corresponding to one query input. The metadata here includes a dictionary for each document, containing the length of the text: {'text_length': 19} for the first document and {'text_length': 23} for the second.

- 'uris': This field is None, suggesting that there are no URIs (Uniform Resource Identifiers) associated with the returned documents in this query. URIs, if used, would typically provide a way to access the document or its location in a database or on the internet.

In summary, the query returned two documents in response to the search query. For each document, it provides the text, its ID, the metadata about text length, and the distance from the query, indicating their relevance to the search term. The absence of actual embeddings or additional data suggests that the query was configured to return only specific pieces of information.


Advanced Querying with Metadata Filters:

Advanced querying can be done using metadata filters. For example, if you want to find documents of a certain length, you can use:

filtered_results = collection.query(
    query_texts=["Find documents similar to this text"],
    n_results=2,
    where={"text_length": {"$gt": 20}}
)

print(filtered_results)
        

This will only return documents with a text_length metadata value greater than 20.

{'data': None,
 'distances': [[1.890620414904181]],
 'documents': [['Jumps over the lazy dog']],
 'embeddings': None,
 'ids': [['doc2']],
 'metadatas': [[{'text_length': 23}]],
 'uris': None}        

Conclusion

Chroma DB brings an exciting new layer of efficiency and functionality to the realm of AI applications. By following the steps outlined above, you can integrate this technology within your Jupyter Notebooks using Python, streamlining your AI workflows from the retrieval of complex embeddings to querying based on intricate parameters. Whether you are working on NLP tasks, building recommendation systems, or any AI-driven application, Chroma DB can significantly enhance your capabilities in handling and utilizing embeddings.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了