What Hardware Do You Need for RAG with GenAI?

What Hardware Do You Need for RAG with GenAI?

I recently shared some thoughts about using retrieval-augmented generation (RAG) as an alternative to training GenAI models. While a lot of businesses and research teams are looking to use their data to get customized outputs from GenAI, they may not be able to spend the time and money it takes to train models. And they do not need to in many cases.?

With RAG, they can use off-the-shelf models like Llama 2, Mixtral, DBRX or other state-of-the-art LLMs and build a workflow to retrieve context information from a precomputed vector store, send it to the model, and generate a contextualized response. The response might be a text summary, or a block of code, or even a video scene analysis. Besides lowering the barrier to entry to GenAI, RAG is also the only practical way to generate responses using the latest data without retraining, since the data lives outside the model itself.?

You don’t have to be an AI researcher to put GenAI and RAG to work

Three years ago, my team was only starting to experiment with RAG as part of a solution that could hop from one source of information to another. We competed on the WebQA NeurIPS’21 challenge, where our multimodal and multi-hop RAG solution made it to the winning list. Today, you can build a RAG workflow from open-source components on Hugging Face. Check out this simple cookbook from Hugging Face, or this slightly more advanced one. You don’t need to train or retrain a model, and you don’t need extensive data science expertise. You don’t even need to stick to text-based data. RAG is a great way to support multimodal data—video, audio, images, and text all together.?

Start by defining the body of knowledge you want to build on. For example, a pharma company might want to use an LLM to explore an archive of its test results. A university physics department might want to query every paper published by its faculty and then use open-source tools to ingest, embed, and create a vector index of the information. Once you have the data indexed, you can inject query retrieval results into an off-the-shelf model like Llama 2 to get personalized responses. Does this sound too good to be true? Check out this exact workflow with code examples on the Intel Gaudi 2 AI accelerator on how to create a vector index of the Constitution of the United States, and then use the Llama 2 model to answer user queries, where the LLM’s answers are grounded in the passages of the Constitution.?

What do you need to run RAG?

In our Cognitive AI team at Intel Labs, we tend to work on local hardware or private cloud resources. I’m lucky that my team and I have access to some of the most advanced AI clusters in the Intel Developer Cloud powered by Intel Xeon CPUs and Intel Gaudi 2 AI accelerators. But I still think about the most efficient way to run GenAI, because the more efficient it is, the more people can use it.?

Whether you’re running GenAI in your data center, on an edge server, or in the cloud, you can save yourself some money (both in terms of hardware cost and power consumption) by choosing the right hardware for each part of the RAG pipeline.?

Let’s break the RAG workflow into three parts: data prep, data retrieval, and result generation. CPUs with vector extensions (like AVX-512 on Intel Xeon processors) excel at data prep and retrieval. You can do all your ingesting, embedding, and vectorization with a CPU-only node. When querying the model, the CPU can quickly fetch data from your vectorized databases, ensuring you won’t have a bottleneck during the “retrieval” part of RAG.?

For result generation, you’ll run inference on your model of choice. An AI accelerator is a fantastic choice for this part of the process. In our lab, we make extensive use of AI clusters in the Intel Developer Cloud where individual nodes comprise of eight Intel Gaudi 2 AI accelerators with dual-socket Intel Xeon as the host CPU.?

Effective and efficient RAG pipelines mean better and faster results

RAG workflows are a big part of our work at Intel Labs, and our Cognitive AI team is currently working on ways to make those pipelines more effective and efficient. There are many open research problems around RAG that my team is looking into—for example, our recent NAACL 2024 paper on Semi-Structured Chain-of-Thought investigates the question of when a RAG AI system should rely on its parametric memory vs. external knowledge that is retrieved. By imposing structure on the AI’s reasoning chain, we show how information from the parametric memory, retrieved context documents, and structured knowledge bases can be seamlessly integrated to improve performance on complex reasoning tasks as shown in the diagram below. We are also investigating other research directions such as how multimodal information in the form of video segments or images can be most effectively provided as external context to LLMs.

Framework for integrating multiple sources of knowledge into an LLM's reasoning chain. For more details check out our

In the meantime, there is a lot of value to extract using simple workflows that deploy RAG to customize LLMs to your own data. To start today, please check out this code example that implements RAG flow on the Intel Gaudi 2 AI accelerator. You can try out Intel Gaudi 2 using the Intel Developer Cloud. Our ultimate goal is an industry-standard stack for RAG implementations built around PyTorch and OpenVINO. With a solid foundation in place, you’ll start to see a lot more solutions from the ecosystem that make GenAI even more accessible to everyone.

Aneel Rijhwani

Product Management & Strategy

6 个月

Thanks for sharing the insights!

Mehandi Islam

Helping Businesses Scale with Automation & AI | 5+ Years of Experience | CEO @ GrowthFusion Consultancy

6 个月

Sounds like an exciting exploration of RAG workflows on Intel hardware! ??

Great insight into how effective and efficient RAG pipelines are helping to make #GenAI even more accessible to everyone. Thanks for sharing Vasudev Lal. #Intel #AIEverywhere

要查看或添加评论,请登录

社区洞察

其他会员也浏览了