The Data Prep Kit and Open Source RAG

The Data Prep Kit and Open Source RAG

On November 6th, 2024 at 6pm, the AI Alliance NYC had an amazing meetup at Lightning AI's headquarters in Manhattan. The agenda was packed with a lot of great talks. I wanted to give you a quick summary of my topic and a look forward to the future.

Agenda

  • Meet the AI Alliance (5') – Dean Wampler, IBM
  • Generative AI for Materials Design (5') – Stefano Martiniani, NYU
  • Accelerating the AI Lifecycle with Lightning Studios (12') – Rob Levy, Lightning.ai
  • Evaluating LLM Applications with AutoEval (12') – Sarmad Qadri,?LastMile.ai
  • Accelerating Triton Kernels (12') –?Adnan Hoque, Research Engineer at IBM
  • Quick intro to the Data Prep Kit and Open Source RAG (12') –?Timothy Spann, Milvus
  • Panel discussion (20')

Quick Intro to the Data Prep Kit and Open Source RAG

See: https://www.slideshare.net/slideshow/tspann06-nov-2024_ai-alliance_nyc_-intro-to-data-prep-kit-and-open-source-rag/273079590

In today's enterprise, the increase is unstructured data is growing at a steady pace and makes for difficulty in extracting value from this valuable source of potential knowledge. With the rise of deep learning and AI, the need for this data grows daily. Fortunately these same and other models are part of the solution.

By using Deep Learning neural network embedding models we can take our documents and other unstructured data and convert them into Vectors which are high-dimension arrays and store them in a vector databases. This allows us to store, update, delete, query, search and utilize these data sources as a Knowledge Base ideal for using with Large Language Models and chat.

Unstructured data does not have a predefined data model or schema makes up a large portion of the world's data, 90% of the data is never even analyzed. This includes data like text, images, videos and audio.

Common Tasks Preparing Data for LLMS

Documents

  • De-duplications of docs
  • Extracting text from PDFs and other documents
  • Removing excessive markup
  • Tokenizing and Chunking
  • Assessing doc quality
  • PII detection

Code

  • Language detection
  • Malware detection
  • Code quality

The Data Prep Kit

https://github.com/sujee/data-prep-kit-examples/tree/main/dpk-intro

The Data Prep Kit is an open source modular Python libraries created by IBM to scale from a laptop up to clusters. It helps with data preparation, handling documents and code with many ready to use modules available directly from the library.

With the Data Prep Kit you can take code, html or PDF documents and quickly turn them into Parquet files in a single step then you can add a deduplication step, validate document quality, chunk and embed your document into a vector database to use as part of RAG, fine tuning and instruct tuning. There are a lot of options you can add to your pipeline with more coming, please feel free to develop you own module. I have a few I just started on.

You can then deploy to Ray, Spark or Kubeflow for production enterprise workloads.

Open Source RAG

Retrieval-Augment Generation (RAG) is a technique that combined retrieval-base and generative models. This improves accuracy and reduces hallucination in the LLMs. This is a great way to inject domain-specific knowledge into your prompts.

We can do this all with open source software, models and tools. For our Open Source RAG stack we utilize Docling, Data Prep Kit, Hugging Face, LangChain, LlamaIndex, LLama, IBM Granite, Milvus and Python.

In the follow up video we will walk you through how to build your own utilizing these open source tools and we are scheduling workshops around the country. Let's build amazing AI applications in open source.

Example RAG Flow


References


Keywords: AI Alliance, Open Source RAG, Unstructured Data, LLM Applications, Data Prep Kit, LLamaIndex, Milvus, Vector Database


Dean Wampler, AI Alliance, IBM


Impressive work! The Data Prep Kit simplifies data preprocessing challenges, while Open Source RAG enables streamlined retrieval-augmented generation, fostering innovation in AI workflows. Thank you for sharing these groundbreaking tools.

Tim Spann

Senior Solutions Engineer

4 个月

#AIAlliance Thanks to Dave Nielsen, Santosh Borse, Dean Wampler, Stefano Martiniani, IBM, Lightning.AI, Rob Levy, Sarmad Qadri, LastMile.AI, Adnan Hoque, Andrew Hoh, Phil Chang, The Community, and all attendees. Lightning AI LastMile AI IBM The AI Alliance Milvus

要查看或添加评论,请登录

Tim Spann的更多文章

社区洞察

其他会员也浏览了