登录查看更多内容

ETL for LLMs

Soma Sundaram

Senior Database Administration - Advisor at CardConnect (a Fiserv Company)

发布日期: 2024年4月13日

Thrilled to have complete this exciting short course "Preprocessing Unstructured Data for LLM Applications" over this weekend.

Being worked in Traditional Data warehousing and Data Integration projects, historically it's been a challenge to integrate unstructured data and normalize them for analysis. Thanks to the evolution of AI research techniques and LLMs - Automatically detecting the data type/document, partition the information as small chunks [chunking - this is similar to partitioning in RDBMS world for improved processing and maintenance] seems easier. Here comes unstructure.io to the rescue for pre-processing unstructured data and aptly named as "ETL for LLMs".

About unstructured.io

The unstructured library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed using the unstructured library include PDFs, XML and HTML documents.

Unstructured library supports myriad of file types both through hosted as well as through API.

https://unstructured.io/blog/understanding-what-matters-for-llm-ingestion-and-preprocessing

I recommend to read more about the capabilities in their blog site.

By the way unstructured.io is one of the top AI company as part of the AI 50 list published by Forbes last week. - https://www.forbes.com/lists/ai50

Unstructured and RAG Model architecture - A perfect combo:

References:

OCR-free Document Understanding Transformer [link]
TableFormer: Table Structure Understanding with Transformers [link]
YOLOX: Exceeding YOLO Series in 2021 [link]
Unstructured examples [link]

要查看或添加评论，请登录

查看全部

ETL for LLMs

Soma Sundaram

Senior Database Administration - Advisor at CardConnect (a Fiserv Company)

更多精彩文章

社区洞察

其他会员也浏览了

Reference Architecture for RAG applications

Impact of LLMs on the evolving data + ML stack

The Impact of Machine Learning on Data Pipelines: Challenges and Opportunities

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

DATA Pill #073 - Building ETL pipelines with Generative AI, Elementary for dbt

?? DATA Pill #115 - CI/CD at Amazon vs. Google, Building Churn Models, LLM Principles

?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

?? DATA Pill #099 - Conventional RAG → Graph RAG, Knowledge Graphs using Neo4j and Vertex AI

Subject: ?? DATA Pill #098 - Deploy LLM in your Private Kubernetes Cluster, The Real Cost of Self-Hosting MLflow

ML and CI/CD Pipelines for Unstructured datasets: Efficiency and Optimization Investigation

Agentic AI Workflow - Design Patterns

2024年4月5日

ML Ops - Tools and platform

2024年3月27日

Efficiently Serving LLMs - course notes

2024年3月19日

LLM Finetuning

2024年3月17日

GraphRAG: RAG using Graph Database!

2024年3月14日

Learning about Open Source Models with Hugging Face

2024年3月9日

Learning Prompt Engineering using Llama2

2024年3月5日

ML on the edge

2024年2月29日