ETL for LLMs
Soma Sundaram
Senior Database Administration - Advisor at CardConnect (a Fiserv Company)
Thrilled to have complete this exciting short course "Preprocessing Unstructured Data for LLM Applications" over this weekend.
Being worked in Traditional Data warehousing and Data Integration projects, historically it's been a challenge to integrate unstructured data and normalize them for analysis. Thanks to the evolution of AI research techniques and LLMs - Automatically detecting the data type/document, partition the information as small chunks [chunking - this is similar to partitioning in RDBMS world for improved processing and maintenance] seems easier. Here comes unstructure.io to the rescue for pre-processing unstructured data and aptly named as "ETL for LLMs".
About unstructured.io
The unstructured library is designed to help preprocess and structure unstructured text documents for use in downstream machine learning tasks. Examples of documents that can be processed using the unstructured library include PDFs, XML and HTML documents.
Unstructured library supports myriad of file types both through hosted as well as through API.
I recommend to read more about the capabilities in their blog site.
By the way unstructured.io is one of the top AI company as part of the AI 50 list published by Forbes last week. - https://www.forbes.com/lists/ai50
Unstructured and RAG Model architecture - A perfect combo:
References: