BetVictor Slots,Caliz philippines big bet cast.REGISTER NOW GET FREE 888 PESOS REWARDS!

I had trouble which Vector DB to examine next after Pinecone. This handy diagram neatly divides the Vector DB landscape into primarily Self-Host/Open-Source and Managed DB in the Cloud. Somewhat giving a clue which to cover.

Chroma DB was selected in Langchain State of AI 2023 as developer's choice for local vector storage.

How we can interpret the results is really up to us and what we need for our RAG applications. On a more wary note, it is too early to determine which Vector DB is best overall and the results more weighed in favour of Langchain developers' preferences. In fact those who are not Langchain users might prefer other options but we simply cannot know. But we need to start somewhere.

Upon reviewing Chroma's Homepage

Chroma gives you the tools to:

store embeddings and their metadata
embed documents and queries
search embeddings

Chroma prioritizes:

simplicity and developer productivity
analysis on top of search
it also happens to be very quick

Chroma offers self-hosting your own and the cloud managed service (which will be released soon).

To get started, it is really that simple. It runs in a notebook.

pip install chromadb 
import chromadb

This installs both the chromadb locally and provides the python SDK to interact with the vector store. No sign up or API keys needed.

Initializing the client is similar to how we did for Pinecone.

chroma_client = chromadb.Client()

Instead of Index, Chroma calls it a Collection. The following is a command to create "my_collection"

collection = chroma_client.create_collection(name="my_collection")

Chroma's diagram perhaps sums a typical RAG life cycle. For now we focus on the Queries, Embeddings and Retrieval. In another session, we can cover Generation.

!pip install chromadb openai langchain tiktoken

I will be using OpenAI's Prompt Engineering Guide: Prompt engineering - OpenAI API. for our reference material.

We will need to convert it to a format we can load to Chroma, including, of course, with the OpenAI embeddings. Langchain naturally supports ways to parse content from a URL but that is beyond scope.

Clipper is a Node.js command line tool that allows you to easily clip content from web pages and convert it to Markdown. This requires the use of the Terminal and some preparation bogging us down. Use this for converting online sources in bulk, keeping in mind copyright and fair use policies.

For simplicity, I will use MarkDownload a browser extension: MarkDownload - Markdown Web Clipper ( google.com ). This downloads the MD formated file to your Downloads directory

Let's simplify the processing of integrating and embedding with Chroma using Langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os

# load the document and split it into chunks
loader = TextLoader("name_of_the_file.txt")
documents = loader.load()

We will upload the markdown file into colab. The code above will load the file and store it in documents.

Simply drag and drop the file to colab files. Note the file will not be stored permanently and will be lost when the session is terminated.

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(documents)

# create the OpenAI embedding function
embedding_function = OpenAIEmbeddings(model="text-embedding-ada-002")

We can't store all the to-be embedded text in one huge chunk. Langchain offers various methods to split the text in desired sizes with overlapping. The latter is needed if you want to retain the context from the previous entry/sentence/paragraph.

We also need an Embedding function. We use Langchain's OpenAIEmbedding class with text-embedding-ada-002 model.

# load it into Chroma
db = Chroma.from_documents(all_splits, embedding_function)

We now store the split documents and use Langchain to load each documents using embedding function we created earlier. A lot of details have been abstracted away as you can see.

# query it
query = "What is a good tactic for hiding inner monologue?"
docs = db.similarity_search(query)

print(docs)
# print results
print(docs[0].page_content)

We are now ready to query our split documents from OpenAI's prompt engineering guide.

The previous tactic demonstrates that it is sometimes important for the model to reason in detail about a problem before answering a specific question. For some applications, the reasoning process that a model uses to arrive at a final answer would be inappropriate to share with the user. For example, in tutoring applications we may want to encourage students to work out their own answers, but a model’s reasoning process about the student’s solution could reveal the answer to the student. Inner monologue is a tactic that can be used to mitigate this. The idea of inner monologue is to instruct the model to put parts of the output that are meant to be hidden from the user into a structured format that makes parsing them easy. Then before presenting the output to the user, the output is parsed and only part of the output is made visible.

Unlike Pinecone, Chroma db we generated will not persist so we can alternatively save the data to be recalled again or, in the first place, host Chroma externally outside of our compute notebook where storage is ephemeral. A good local option is to go for Docker. ?? Deployment | Chroma ( trychroma.com )

query = "What is the work around for fixed context length, during a dialogue between a user and an assistant"
retriever = db.as_retriever(search_type="mmr")
retriever.get_relevant_documents(query)[0]

Other than matching results based on closest distance (consine) we can retrieve the relevant documents using the mmr which could be more handy for providing context for a LLM to respond back to user queries.

Other methods such as filtering based on metadata is possible. This is important for regulatory, role-based control purposes. For instance, if there is a document management system with Vector Search enabled, we want to ensure that Commercial user group in Pharma cannot examine the confidential documents/information by the Medical/Clinical team. Before any data is stored in the system, meta-data and extensive controls would help unwanted retrieval and generation.

I hope this has given you sufficient nudge to try Chroma out for yourself and integrate it into your applications/ Use cases.

Let's push ahead with the next Vector DB, Qdrant. Note that I will be posting more detailed tutorials on Chroma in future dates when the hosted service is available.

Here are today's notebooks to try out the code yourself.

https://colab.research.google.com/drive/1Hv9r-bkn_RnBoVK6r2GmjAj7pkNEr3f5?usp=sharing

23-5 Chroma Vector Store with Langchain - Local Retrieval with HTML docs

Won Bae Suh

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Upgrading to Next.js 15: Navigating Challenges and Finding Solutions

Bridging Decades: How FrogFind Unlocks Modern Web for Legacy Computers

Practical Example: Updating the Google Ecommerce DataLayer in PyScript and Passing it to GTM

Build Node.js RESTful APIs in 10 Minutes

A Guide to Scraping Dynamic e-Commerce website with Python

Scrape.do's Dynamic Solution : Elevate Your Web Scraping Efforts

From HTML to AI: How Developers Have Evolved Over Two Decades (2000 -> 2023)

Using Python To Explain Homepage Redirection To C-Suite (Or Any SEO Best Practice)

Demystifying Hash Tables in JavaScript: A Key to Efficient Data Storage and Retrieval

Web Scraping (2/2)

领英推荐

Maybe Fine-Tuning is Not so Terrible After All

2024年4月3日

(Mis)adventures in GenerativeAI and, well, Just Trust in the Process, Figure things Out, Fail Quickly and Move On.

2024年4月3日

Some Thoughts on Content and Data in Life Science

2024年3月28日

Untitled

2024年3月19日

Raw Notes on Fine-Tuning LLMs

2024年3月5日

Challenges and Observations on building LLM-powered PubMed Chat Assistants (Part 3)

2024年2月26日

Notes on Building GPT-powered PubMed Chat Assistants (Part 2) (feat. Sendbird)

2024年2月25日

Reflections of Building GPT-powered PubMed Chat Assistants (Part 1)

2024年2月24日

Introducing RAST: Retrieval Augmented Search to Translate

2024年1月18日

23-6 Qdrant - Getting started with a Vector DB powered Meme Recommender

2024年1月15日