登录查看更多内容

The Data Prep Kit and Open Source RAG

Tim Spann

Senior Solutions Engineer

发布日期: 2024年11月12日

On November 6th, 2024 at 6pm, the AI Alliance NYC had an amazing meetup at Lightning AI's headquarters in Manhattan. The agenda was packed with a lot of great talks. I wanted to give you a quick summary of my topic and a look forward to the future.

Agenda

Meet the AI Alliance (5') – Dean Wampler, IBM
Generative AI for Materials Design (5') – Stefano Martiniani, NYU
Accelerating the AI Lifecycle with Lightning Studios (12') – Rob Levy, Lightning.ai
Evaluating LLM Applications with AutoEval (12') – Sarmad Qadri,?LastMile.ai
Accelerating Triton Kernels (12') –?Adnan Hoque, Research Engineer at IBM
Quick intro to the Data Prep Kit and Open Source RAG (12') –?Timothy Spann, Milvus
Panel discussion (20')

Quick Intro to the Data Prep Kit and Open Source RAG

See: https://www.slideshare.net/slideshow/tspann06-nov-2024_ai-alliance_nyc_-intro-to-data-prep-kit-and-open-source-rag/273079590

In today's enterprise, the increase is unstructured data is growing at a steady pace and makes for difficulty in extracting value from this valuable source of potential knowledge. With the rise of deep learning and AI, the need for this data grows daily. Fortunately these same and other models are part of the solution.

By using Deep Learning neural network embedding models we can take our documents and other unstructured data and convert them into Vectors which are high-dimension arrays and store them in a vector databases. This allows us to store, update, delete, query, search and utilize these data sources as a Knowledge Base ideal for using with Large Language Models and chat.

Unstructured data does not have a predefined data model or schema makes up a large portion of the world's data, 90% of the data is never even analyzed. This includes data like text, images, videos and audio.

Common Tasks Preparing Data for LLMS

Documents

De-duplications of docs
Extracting text from PDFs and other documents
Removing excessive markup
Tokenizing and Chunking
Assessing doc quality
PII detection

Code

Language detection
Malware detection
Code quality

The Data Prep Kit

https://github.com/sujee/data-prep-kit-examples/tree/main/dpk-intro

The Data Prep Kit is an open source modular Python libraries created by IBM to scale from a laptop up to clusters. It helps with data preparation, handling documents and code with many ready to use modules available directly from the library.

领英推荐

Next-Gen Data Science & Gen AI: The Transformative…

Pratibha Kumari J. 6 个月前

Instabase and NatWest Unlock Unstructured Data

Instabase 10 个月前

40 Must-Know Data Science Skills & Frameworks, Getting…

Open Data Science Conference (ODSC) 2 年前

With the Data Prep Kit you can take code, html or PDF documents and quickly turn them into Parquet files in a single step then you can add a deduplication step, validate document quality, chunk and embed your document into a vector database to use as part of RAG, fine tuning and instruct tuning. There are a lot of options you can add to your pipeline with more coming, please feel free to develop you own module. I have a few I just started on.

You can then deploy to Ray, Spark or Kubeflow for production enterprise workloads.

Open Source RAG

Retrieval-Augment Generation (RAG) is a technique that combined retrieval-base and generative models. This improves accuracy and reduces hallucination in the LLMs. This is a great way to inject domain-specific knowledge into your prompts.

We can do this all with open source software, models and tools. For our Open Source RAG stack we utilize Docling, Data Prep Kit, Hugging Face, LangChain, LlamaIndex, LLama, IBM Granite, Milvus and Python.

In the follow up video we will walk you through how to build your own utilizing these open source tools and we are scheduling workshops around the country. Let's build amazing AI applications in open source.

References

Keywords: AI Alliance, Open Source RAG, Unstructured Data, LLM Applications, Data Prep Kit, LLamaIndex, Milvus, Vector Database

Tech AI Magazine

3 个月

Impressive work! The Data Prep Kit simplifies data preprocessing challenges, while Open Source RAG enables streamlined retrieval-augmented generation, fostering innovation in AI workflows. Thank you for sharing these groundbreaking tools.

1 次回应

Tim Spann

Senior Solutions Engineer

4 个月

#AIAlliance Thanks to Dave Nielsen, Santosh Borse, Dean Wampler, Stefano Martiniani, IBM, Lightning.AI, Rob Levy, Sarmad Qadri, LastMile.AI, Adnan Hoque, Andrew Hoh, Phil Chang, The Community, and all attendees. Lightning AI LastMile AI IBM The AI Alliance Milvus

2 次回应

查看更多评论

要查看或添加评论，请登录

Tim Spann的更多文章

All Data and AI Weekly #182 - 24-March-2025

2025年3月24日

All Data and AI Weekly #182 - 24-March-2025

All Data and AI Weekly ( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured…
All Data and AI Weekly #181 - 17-March-2025

2025年3月17日

All Data and AI Weekly #181 - 17-March-2025

( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured Data ) #181 -…
All Data and AI Weekly #180 - 10-March-2025

2025年3月10日

All Data and AI Weekly #180 - 10-March-2025

( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured Data )…
All Data and AI?Weekly 179 - 03-March-2025

2025年3月3日

All Data and AI?Weekly 179 - 03-March-2025

All Data and AI Weekly ( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured…
All Data and AI Weekly #178 - 24-Feb-2025

2025年2月24日

All Data and AI Weekly #178 - 24-Feb-2025

#178 - 24-Feb-2025 https://bsky.app/profile/paasdev.
All Data and AI Weekly #177 - 17-Feb-2025

2025年2月17日

All Data and AI Weekly #177 - 17-Feb-2025

( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured Data )…
All Data and AI Weekly #176 - 10-Feb-2025

2025年2月10日

All Data and AI Weekly #176 - 10-Feb-2025

( AI, Data, NiFi, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, SQL, Unstructured Data ) #176 - 10-Feb-2025…
AI and All Data #175 03 February 2025

2025年2月3日

AI and All Data #175 03 February 2025

All Data and AI Weekly ( AI, Data, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, NiFi ) #175 - 03-Feb-2025…
All Data and AI Weekly ( AI, Data, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, NiFi )

2025年1月21日

All Data and AI Weekly ( AI, Data, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, NiFi )

#173 - 20-Jan-2025 https://bsky.app/profile/paasdev.
All Data and AI Weekly #172 - 13-Jan-2025

2025年1月13日

All Data and AI Weekly #172 - 13-Jan-2025

( AI, Data, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, NiFi ) #172 - 13-Jan-2025…

See all articles

The Data Prep Kit and Open Source RAG

Tim Spann

Senior Solutions Engineer

Agenda

Quick Intro to the Data Prep Kit and Open Source RAG

Common Tasks Preparing Data for LLMS

The Data Prep Kit

领英推荐

Open Source RAG

References

Tim Spann的更多文章

社区洞察

其他会员也浏览了

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

Know The Top 10 Data Science Trends (2022)

August 2024 DVC Pulse!

Top Data Science & AI Trends For 2022

Data Science Talent | Newsletter Edition 6

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

The Metamorphosis of Data Science: From Data Wrangling to Holistic Problem Solving

Galileo adds computer vision and image recognition

k-Nearest Neighbours (kNN) Imputation Algorithm (with an nice Golang example)

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

Agenda

Quick Intro to the Data Prep Kit and Open Source RAG

Common Tasks Preparing Data for LLMS

The Data Prep Kit

领英推荐

Open Source RAG

References

Tim Spann的更多文章

All Data and AI Weekly #182 - 24-March-2025

All Data and AI Weekly #181 - 17-March-2025

All Data and AI Weekly #180 - 10-March-2025

All Data and AI?Weekly 179 - 03-March-2025

All Data and AI Weekly #178 - 24-Feb-2025

All Data and AI Weekly #177 - 17-Feb-2025

All Data and AI Weekly #176 - 10-Feb-2025

AI and All Data #175 03 February 2025

All Data and AI Weekly ( AI, Data, Iceberg, Polaris, Streamlit, Flink, Kafka, Python, Java, NiFi )

All Data and AI Weekly #172 - 13-Jan-2025

社区洞察

其他会员也浏览了

Spotlight on Databricks RAG Tools, Vector Search, Feature & Function Serving

Know The Top 10 Data Science Trends (2022)

August 2024 DVC Pulse!

Top Data Science & AI Trends For 2022

Data Science Talent | Newsletter Edition 6

DATA Pill #052 - LLM, observability, Data Catalogs & storage cost reduction again

The Metamorphosis of Data Science: From Data Wrangling to Holistic Problem Solving

Galileo adds computer vision and image recognition

k-Nearest Neighbours (kNN) Imputation Algorithm (with an nice Golang example)

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j