登录查看更多内容

Unstructured Data for AI

Simon Ludwigs

Director at Alvarez and Marsal Southeast Asia and Australia | Digital and Technology Services | Private Equity Services | CIO Services

发布日期: 2024年7月23日

TL;DR The quality of data preparation is crucial for successful AI projects. Poor data quality can lead to "Garbage In, Garbage Out" results. Traditional tools like OCR struggle with complex documents, tables, and charts including financial reports and excels. However, new technologies like Vision-Language Models (VLMs) are emerging to address these limitations. Companies like unstructuredIO, LlamaParse, ColPali, and SpreadsheetLLM are pioneering data extraction solutions using VLMs and other advanced technologies.

Good Data = Good AI or more precise "Garbage In, Garbage Out" highlights the pitfalls of poor data quality in every AI project. As a business consultant, I witness firsthand the importance of data collection and representation in every AI project. It's a fundamental aspect and yet, the biggest and far underestimated workload in any AI initiative.

Also for myself, I frequently engage with company reports, pitch decks, information memorandums, and data rooms. These documents are treasure troves of information. However, they pose a significant challenge for my AI and Large Language Models (LLMs) projects due to their complex nature. Extracting meaningful data from PDFs, PowerPoint presentations, and Excel sheets is a daunting task because these formats often contain mixed content, including charts and images, which consultants tend to overload with information. Despite the amount of tools available for Extract, Transform, Load (ETL) processes aimed at structured and semi-structured data, a substantial portion of enterprise data remains untapped as traditional tools e.g. Optical Character Recognition (OCR) technologies fall short. They struggle with layout nuances and hit a wall when faced with tables and charts.

But AI seems to solve it's own issue with Vision-Language Models (VLMs). These models addressing the limitations of traditional OCR by considering the visual context of the data. VLMs are designed to understand and interpret complex layouts, ensuring that even the most intricately formatted documents are not beyond the reach of AI.

Several startups are pioneering this space, each with unique approaches to overcoming the data extraction challenge. By leveraging advanced VLM technology, they aim to unlock the full potential of enterprise data, enabling businesses to harness the power of AI more effectively.

https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb

LlamaParse [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ ] is a document parsing tool by LlamaIndex, designed to handle complex documents with tables, figures, and embedded objects. It integrates seamlessly with LlamaIndex’s ingestion and retrieval services, enabling precise data extraction and query answering for RAG applications. LlamaParse supports a variety of document types, including PDFs, Word documents, and PowerPoint presentations. Its advanced capabilities allow for natural-language parsing instructions, significantly improving accuracy and efficiency. Available in public preview, it also offers a managed ingestion and retrieval API for enterprise use. Unfortunatly, it seems that the source code and models used remains closed.

Bernard Marr 3 个月前

Vector search, RAG, and large language models

Clara Shih 11 个月前

Data Analytics in the Age of AI, When to Use RAG…

Open Data Science Conference (ODSC) 7 个月前

https://unstructured.io/blog/speeding-up-text-generation-with-non-autoregressive-language-models

UNSTRUCTURED - Therfore I even fall more in love with unstructured.io which has a very solid open source community on github , beside it's enterprise ready API services. Their platform excels in transforming unstructured data into LLM-compatible formats, making it a valuable tool for enterprises seeking to optimize their data processing workflows. The platform supports over 30 built-in connectors, enabling seamless integration with various data sources and destinations. Additionally, features like customizable chunking, embedding strategies, and error handling ensure high performance and reliability.

ColPali is even going further, to only use a VLM to extract information. Instead of first extracting the text from the document, ColPali just embed an image of the document page directly to keep most of the information. To obtain good performance with this concept, they leverage modern Vision Language Models that are able to read and understand text, tables, and figures from the images. The model can be spotted here: https://huggingface.co/vidore/colpali

As the industry is evolving, new players try to join, having e.g. Y Combinator backed Trellis AI . 'Trellis converts your unstructured data into SQL-compliant tables with a schema you define in natural language. With Trellis, you can now run SQL queries on complex data sources like financial documents, contracts, and emails. Our AI engine guarantees accurate schema and results.' https://www.ycombinator.com/companies/trellis

Omni https://getomni.ai/ backed by Y-Combinator is providing an AI platform dedicated to handle unstructured data by providing advanced AI tools to extract and transform data from various sources like documents, phone calls, and chat messages into database like searchable entries. By integrating industry-leading models such as Llama 3, Mistral, and Claude, OmniAI ensures efficient and accurate data processing within the custoemrs existing infrastructure.

Excel & LLM - Excel is a very common source for information in my field, unfortunatly current versions of LLMs, including ChatGPT-4, still face significant challenges when dealing with Excel data (esp. when there are not formed as single table per sheet). However, recent research has made strides in this area. A notable example is the development of SpreadsheetLLM , which introduces an efficient encoding method specifically designed for spreadsheets. This framework, called SheetCompressor, significantly enhances LLMs' ability to understand and process spreadsheets by incorporating structural-anchor-based compression, inverse index translation, and data-format-aware aggregation.

In conclusion, the success of any AI project hinges on the quality of data preparation. While the tools and technologies lime unstructured.io , LlamaParse, ColiPali and SpreadsheetLLM are evolving, the fundamental principle remains unchanged: good AI outcomes requires good data . As we continue to innovate and improve data extraction methods, the future of AI looks promising, with more accurate and insightful models driving business success. What other tools or ETL pipelines have you used for data extraction and LLM preperation?

Disclaimer: The views and opinions expressed in this article are solely those of the author and do not necessarily represent the official policy or position of any affiliated organization, including my employer and the publisher of this article.

Artem Pochepetskyi

3 个月

Simon, thanks for sharing!

Koenraad Block

Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance

4 个月

"Unstructured Data for AI" explores the crucial role of unstructured data in artificial intelligence. ???? From text and images to audio and video, unstructured data provides rich, diverse information that AI models can leverage for better insights and decision-making. Understanding and utilizing this data type is essential for advancing AI capabilities and achieving more nuanced, accurate results. ????

Michael Klawon ??

Founder & CEO | Digital M&A Thought Leader | Make M&A more efficient by applying online software ? see how in 'about me'

4 个月

Great article, Simon, and good to read from you. AI indeed excels at transforming unstructured data into structured formats (including garbage;-). But in the digital age, why are we still dealing with documents, especially in M&A? We prioritize structured data, which also allows us to apply AI for superior data quality. If you have time please DM me, I would like to learn more about your project and LlamaParse.

1 次回应

Sean Yap

IoT | Power Platform | AWS & Azure certified | DX

4 个月

Thanks for this article. Nice review of tools, I can see that it can work for streamling internal processes for starters!

1 次回应

Simon Ludwigs

Director at Alvarez and Marsal Southeast Asia and Australia | Digital and Technology Services | Private Equity Services | CIO Services

4 个月

Manuel Faysse thanks for your research on this topic

查看更多评论

要查看或添加评论，请登录

查看全部

Unstructured Data for AI

Simon Ludwigs

Director at Alvarez and Marsal Southeast Asia and Australia | Digital and Technology Services | Private Equity Services | CIO Services

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

While AI needs clean data, clean data needs AI too!

There is No Good AI Without a Good Data Strategy

Quality Data, Powerful AI: Laying the Groundwork for Intelligent Solutions

Data and AI: Building Solid Foundations for Success

What makes an industry well-suited for vertical AI?

Is Your Data Strategy Ready for Generative AI?

How AI and data portals benefit each other

Generative AI might revolutionize Data Science!

10 (free) AI tools for data science

领英推荐

AI-Powered Data Rooms for M&A

2024年7月10日

Leveraging AI in Investment: A concise Overview

2024年7月5日

M&A - Tech Due Diligence?

2024年3月21日

From Data to Decision: How Investment Tech Shapes PE/VC and Professional Services

2024年3月19日

B2B SaaS Sales: Strategies for Success

2024年3月19日

Strategic Insights for Investments in AI Enterprises

2024年2月12日

Startups, do your homework

2019年8月22日

Working Out Loud – The master key?

2019年8月13日

Startup and looking for a CTO?

2019年8月5日

The Rise of Artificial Intelligence in HR

2019年7月28日

社区洞察

其他会员也浏览了

While AI needs clean data, clean data needs AI too!

There is No Good AI Without a Good Data Strategy

Quality Data, Powerful AI: Laying the Groundwork for Intelligent Solutions

Data and AI: Building Solid Foundations for Success

What makes an industry well-suited for vertical AI?

Is Your Data Strategy Ready for Generative AI?

How AI and data portals benefit each other

Generative AI might revolutionize Data Science!

10 (free) AI tools for data science