Building a RAG Pipeline for Enterprise Content Using Mamba

Building a RAG Pipeline for Enterprise Content Using Mamba

Introduction

In today's data-driven world, enterprises are constantly seeking innovative solutions to manage and leverage their vast repositories of content efficiently. Retrieval Augmented Generation (RAG) pipelines represent a cutting-edge approach to content retrieval and generation by combining the power of large language models with information retrieval techniques. Mamba is a state-of-the-art selective state model (SSM), and enterprises can leverage it to construct robust RAG pipelines tailored to their specific needs. It can revolutionize how they interact with and derive insights from their content.

RAG pipelines integrate two critical components: retrieval and generation. The retrieval component sifts through the enterprise's content repositories to identify relevant information based on user queries or prompts. This retrieved information serves as context for the generation component, which employs advanced language models like Mamba to generate accurate and contextually relevant responses, summaries, or insights.

For example, Ed-Tech industries can harness the Mamba model for enterprise content generation in a very impactful way. Their platforms can utilize the RAG pipeline to create personalized learning materials for students. By retrieving relevant educational resources from vast content repositories, such as textbooks, academic journals, and online courses, and generating explanations, summaries, or practice questions, educators can provide students with highly targeted and engaging learning experiences.

Leveraging Mamba for the RAG pipeline can support content localization efforts by retrieving educational resources in multiple languages and generating translations, adaptations, or culturally relevant examples for diverse student populations. This capability enables EdTech platforms to expand their reach and deliver high-quality educational content to learners worldwide.

Let’s see how we can leverage the Mamba model for Ed-tech enterprise content generation.

Why Leverage E2E Networks’ Cloud GPUs?

Mamba, particularly when used in complex tasks like Retrieval Augmented Generation (RAG), can be computationally intensive. Cloud GPUs offer high-performance computing capabilities that can handle the computational demands of running Mamba efficiently. They enable parallel processing of tasks, which is crucial for speeding up the execution of Mamba, especially when dealing with large datasets or complex models. This parallelization can lead to faster inference times and more responsive systems.

In a Retrieval Augmented Generation pipeline, where the workload may vary over time or with different datasets, the ability to dynamically adjust GPU resources ensures optimal performance and cost-effectiveness. Mamba may require significant memory resources, especially when processing large datasets or models, which can be easily handled using Cloud GPU resources. While cloud GPU services involve operational costs, they can be more cost-effective than purchasing and maintaining dedicated hardware, particularly for variable workloads. Users pay only for the resources they consume, which makes it a cost-efficient option for running Mamba-based RAG pipelines.?

This is the place where E2E Networks comes into the picture. E2E Networks provides a variety of Cloud GPUs which you can see here in the Product list. The Cloud GPUs are affordable and highly advanced. To get started, create your account on E2E Networks’ My Account portal. Login to your E2E account. Set up your SSH keys by visiting Settings.

After creating the SSH keys, visit Compute to create a node instance.

Open your Visual Studio code, and download the extension Remote Explorer and Remote SSH. Open a new terminal. Login to your local system with the following code:


ssh root@
        

With this, you’ll be logged in to your node.?

RAG Pipeline Using Mamba

Install the dependencies needed to make a RAG pipeline using the Mamba model.


%pip install mamba-ssm langchain fastembed qdrant-client datasets transformers 
        

As we are going to make a RAG pipeline for Ed-Tech Industries, let’s try the Cosmopedia dataset having an OpenStax subset. Using ‘datasets’, download the dataset and store it in the CSV file.


from datasets import load_dataset
data = load_dataset("HuggingFaceTB/cosmopedia", "openstax", split="train")


df = data.to_pandas()
df.to_csv("openstax.csv")
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./Notebooks/openstax/openstax.csv')
data = loader.load()

        

Then, using Text Splitter, we will split documents into chunks.


from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(data)

        

Now, we’ll create embeddings of the texts that we got after text splitting. We will use FastEmbed for creating embeddings. To know more about its models, visit here.?


from langchain.embeddings import FastEmbedEmbeddings
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5")
        

The embeddings are ready; we need to save them in a vector database so that the retrieval will be easy. Here, we are using the Qdrant vector database. It can store the embeddings in the in-memory storage. We named the collection ‘Edtech’.


from langchain.vectorstores import Qdrant
qdrant = Qdrant.from_documents(texts,
                               embeddings,
                               location=":memory:",
                               collection_name="Edtech")
                                       

Now, we need to prepare the model. But, before we prepare the model, let’s know what Mamba is.

Mamba

Mamba is a state-of-the-art sequence modeling architecture designed to handle a wide range of tasks that require understanding and generation of sequential data. Mamba is developed on the foundation of Selective State Machines (SSM); it leverages the concept of SSMs, which enables it to selectively remember and utilize relevant information while discarding unnecessary details. Mamba excels in autoregressive language modeling tasks, which demonstrates a competitive performance compared to established Transformer architectures.

Model Components

  1. Linear Projections: Mamba utilizes linear projections to transform input sequences into a suitable representation for processing.
  2. Selective Mechanism: The selective mechanism in Mamba, facilitated by the SSM architecture, allows the model to focus on relevant information and discard irrelevant details efficiently.
  3. Attention Mechanism: Mamba incorporates attention mechanisms to capture dependencies between different parts of the input sequence, particularly useful for tasks requiring context understanding.

Let’s prepare the model. We’ll use the mamba model by state spaces in this blog post.


from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
import torch
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b-slimpj", device="cuda", dtype=torch.float16)

        

Then, we will use the GPT-NEOX 20B model for tokenizing. We took this idea from the Mamba paper.


from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
tokenizer.pad_token = tokenizer.eos_token
        

Then, we will define a function using asyncio, where we will first encode the prompt and get the input_ids. Then, we will initialize the retriever. After that, we will use the retriever to get similar documents. As our data is the list of documents, we will extract the texts using the page_content. After that, we will use the model to generate the output and decode it using the tokenizer. We’ll pass a query to generate the content; let’s see the response:


import nest_asyncio
import asyncio
nest_asyncio.apply()

# Define the function
async def generate_answer(query):
    # Encode the prompt
    input_ids = tokenizer.encode(query, return_tensors='pt').to("cuda")

    # Get a Retriever object from Qdrant
    retriever = qdrant.as_retriever()

    # Use the retriever to get similar documents
    similar_documents = await retriever.aget_relevant_documents(query)
    similar_texts = [doc.page_content for doc in similar_documents]

    # Use the similar texts to generate new text
    generated_text = model.generate(input_ids, max_length=200)
    decoded_text = tokenizer.decode(generated_text[0], skip_special_tokens=True)
    
    return decoded_text

# Use the function
query = "Global Stratification"
answer = asyncio.run(generate_answer(query))
print(answer)

        

The following will be the response:


The Global Stratification System (GSS) is a system of stratification that is used to classify the world's countries and territories into four groups: high-income, upper-middle-income, lower-middle-income, and low-income. The GSS is used to determine the level of development of a country or territory and to compare the level of development between countries and territories.
The GSS is based on the United Nations' Human Development Index (HDI), which is a composite index that measures a country's level of development based on three key indicators: life expectancy at birth, education, and per capita income. The GSS is updated every five years and is used to determine the level of development of countries and territories around the world.
The GSS is divided into four groups: high-income, upper-middle-income, lower-middle-income, and low-income.
        

Conclusion

As we saw, Mamba works slightly differently from the approach we are used to when using Transformer-based models. Mamba models improve upon Transformers and are more efficient. In this article, we demonstrated steps to build a RAG pipeline using Mamba and vector databases. We expect to see more SSM-based architectures emerge in the near future.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了