Unstructured Data for AI
Simon Ludwigs
Director at Alvarez and Marsal Southeast Asia and Australia | Digital and Technology Services | Private Equity Services | CIO Services
TL;DR The quality of data preparation is crucial for successful AI projects. Poor data quality can lead to "Garbage In, Garbage Out" results. Traditional tools like OCR struggle with complex documents, tables, and charts including financial reports and excels. However, new technologies like Vision-Language Models (VLMs) are emerging to address these limitations. Companies like unstructuredIO, LlamaParse, ColPali, and SpreadsheetLLM are pioneering data extraction solutions using VLMs and other advanced technologies.
Good Data = Good AI or more precise "Garbage In, Garbage Out" highlights the pitfalls of poor data quality in every AI project. As a business consultant, I witness firsthand the importance of data collection and representation in every AI project. It's a fundamental aspect and yet, the biggest and far underestimated workload in any AI initiative.
Also for myself, I frequently engage with company reports, pitch decks, information memorandums, and data rooms. These documents are treasure troves of information. However, they pose a significant challenge for my AI and Large Language Models (LLMs) projects due to their complex nature. Extracting meaningful data from PDFs, PowerPoint presentations, and Excel sheets is a daunting task because these formats often contain mixed content, including charts and images, which consultants tend to overload with information. Despite the amount of tools available for Extract, Transform, Load (ETL) processes aimed at structured and semi-structured data, a substantial portion of enterprise data remains untapped as traditional tools e.g. Optical Character Recognition (OCR) technologies fall short. They struggle with layout nuances and hit a wall when faced with tables and charts.
But AI seems to solve it's own issue with Vision-Language Models (VLMs). These models addressing the limitations of traditional OCR by considering the visual context of the data. VLMs are designed to understand and interpret complex layouts, ensuring that even the most intricately formatted documents are not beyond the reach of AI.
Several startups are pioneering this space, each with unique approaches to overcoming the data extraction challenge. By leveraging advanced VLM technology, they aim to unlock the full potential of enterprise data, enabling businesses to harness the power of AI more effectively.
LlamaParse [https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ ] is a document parsing tool by LlamaIndex, designed to handle complex documents with tables, figures, and embedded objects. It integrates seamlessly with LlamaIndex’s ingestion and retrieval services, enabling precise data extraction and query answering for RAG applications. LlamaParse supports a variety of document types, including PDFs, Word documents, and PowerPoint presentations. Its advanced capabilities allow for natural-language parsing instructions, significantly improving accuracy and efficiency. Available in public preview, it also offers a managed ingestion and retrieval API for enterprise use. Unfortunatly, it seems that the source code and models used remains closed.
领英推荐
UNSTRUCTURED - Therfore I even fall more in love with unstructured.io which has a very solid open source community on github , beside it's enterprise ready API services. Their platform excels in transforming unstructured data into LLM-compatible formats, making it a valuable tool for enterprises seeking to optimize their data processing workflows. The platform supports over 30 built-in connectors, enabling seamless integration with various data sources and destinations. Additionally, features like customizable chunking, embedding strategies, and error handling ensure high performance and reliability.
ColPali is even going further, to only use a VLM to extract information. Instead of first extracting the text from the document, ColPali just embed an image of the document page directly to keep most of the information. To obtain good performance with this concept, they leverage modern Vision Language Models that are able to read and understand text, tables, and figures from the images. The model can be spotted here: https://huggingface.co/vidore/colpali
As the industry is evolving, new players try to join, having e.g. Y Combinator backed Trellis AI . 'Trellis converts your unstructured data into SQL-compliant tables with a schema you define in natural language. With Trellis, you can now run SQL queries on complex data sources like financial documents, contracts, and emails. Our AI engine guarantees accurate schema and results.' https://www.ycombinator.com/companies/trellis
Omni https://getomni.ai/ backed by Y-Combinator is providing an AI platform dedicated to handle unstructured data by providing advanced AI tools to extract and transform data from various sources like documents, phone calls, and chat messages into database like searchable entries. By integrating industry-leading models such as Llama 3, Mistral, and Claude, OmniAI ensures efficient and accurate data processing within the custoemrs existing infrastructure.
Excel & LLM - Excel is a very common source for information in my field, unfortunatly current versions of LLMs, including ChatGPT-4, still face significant challenges when dealing with Excel data (esp. when there are not formed as single table per sheet). However, recent research has made strides in this area. A notable example is the development of SpreadsheetLLM , which introduces an efficient encoding method specifically designed for spreadsheets. This framework, called SheetCompressor, significantly enhances LLMs' ability to understand and process spreadsheets by incorporating structural-anchor-based compression, inverse index translation, and data-format-aware aggregation.
In conclusion, the success of any AI project hinges on the quality of data preparation. While the tools and technologies lime unstructured.io , LlamaParse, ColiPali and SpreadsheetLLM are evolving, the fundamental principle remains unchanged: good AI outcomes requires good data . As we continue to innovate and improve data extraction methods, the future of AI looks promising, with more accurate and insightful models driving business success. What other tools or ETL pipelines have you used for data extraction and LLM preperation?
Disclaimer: The views and opinions expressed in this article are solely those of the author and do not necessarily represent the official policy or position of any affiliated organization, including my employer and the publisher of this article.
Simon, thanks for sharing!
Founder @ Bridge2IT +32 471 26 11 22 | Business Analyst @ Carrefour Finance
4 个月"Unstructured Data for AI" explores the crucial role of unstructured data in artificial intelligence. ???? From text and images to audio and video, unstructured data provides rich, diverse information that AI models can leverage for better insights and decision-making. Understanding and utilizing this data type is essential for advancing AI capabilities and achieving more nuanced, accurate results. ????
Founder & CEO | Digital M&A Thought Leader | Make M&A more efficient by applying online software ? see how in 'about me'
4 个月Great article, Simon, and good to read from you. AI indeed excels at transforming unstructured data into structured formats (including garbage;-). But in the digital age, why are we still dealing with documents, especially in M&A? We prioritize structured data, which also allows us to apply AI for superior data quality. If you have time please DM me, I would like to learn more about your project and LlamaParse.
IoT | Power Platform | AWS & Azure certified | DX
4 个月Thanks for this article. Nice review of tools, I can see that it can work for streamling internal processes for starters!
Director at Alvarez and Marsal Southeast Asia and Australia | Digital and Technology Services | Private Equity Services | CIO Services
4 个月Manuel Faysse thanks for your research on this topic