How do we evaluate the Multimodal Models for key enterprise tasks?
Balaji Viswanathan Ph.D.
Building document AI at scale -- organizing, searching and summarizing enterprise data.
A lot of the benchmarks to evaluate LLMs come from toy problems that researchers create. While it is interesting to see if GPT4 scores well in answering biology questions from the MMLU benchmark, enterprises might want to see it used in problems closer to their use.
I'm working on a benchmark for table cognition. The goal is to build a new dataset that the models have not seen, and the task involves pulling information from tables and answering questions on them.
I'm building this new benchmark, TableCog, a dataset of tables from variety of industries in increasing orders of difficulty. In this article I will show the examples from each level.
Level 1: Simple Text
Here we have a simple text table of a Manual from the tech industry. This can be easily extracted by most PDF reader tools such as PyPDF. The model should be able to answer questions such as:
Level 2: Cluttered Table
Here we have a slightly more cluttered table from the finance industry where the date spills into text beyond just grid. Regular PDF readers are likely to butcher this content.
Here the questions would be:
Level 3: Table surrounded by content.
Here we have a healthcare report where the table is a small part of the image (only about 20%). There is a lot of surrounding content, the model has to ignore and pull only the necessary one and be able to handle the grid lines so close to the text. And the table might be in image form.
Model has to answer questions such as:
Level 4: Large Tables adjanced to each other
Here we have a much larger table and multiple tables next to each other. This would trip most multimodal models.
领英推荐
Level 5: A combination of tables
Here we have low-res image of an invoice.
Key questions:
While this is a harder task than the previous ones, many models have been tuned on Invoice related tasks and thus might score higher.
Level 6: Chart
Here the task is to extract the values out of the chart and populate the table as best as possible.
Level 7: Messy handwritten tables
In many developing countries it is not rare to see handwritten tables in invoices, receipts, annotation and notes.
Answer from this table:
I'm pulling these content from a variety of industries from across the world, in multiple languages, multiple domains and multiple layers of difficulty.
I will release the final benchmark in 2 weeks. In the meanwhile is there a type of table you want to see in this benchmark? Leave your answer in the comments.
R | Power BI | Generative AI | Stress Testing |Collection Analytics | Lean Six Sigma
8 个月Dr. Balaji Viswanathan thanks for sharing... Which LLM multimodal (open source local LLM) is suitable for Level 3 : table surrounded by content?