How do we evaluate the Multimodal Models for key enterprise tasks?

A lot of the benchmarks to evaluate LLMs come from toy problems that researchers create. While it is interesting to see if GPT4 scores well in answering biology questions from the MMLU benchmark, enterprises might want to see it used in problems closer to their use.

I'm working on a benchmark for table cognition. The goal is to build a new dataset that the models have not seen, and the task involves pulling information from tables and answering questions on them.

I'm building this new benchmark, TableCog, a dataset of tables from variety of industries in increasing orders of difficulty. In this article I will show the examples from each level.

Level 1: Simple Text

Here we have a simple text table of a Manual from the tech industry. This can be easily extracted by most PDF reader tools such as PyPDF. The model should be able to answer questions such as:

  1. What does TRB-009 mean?
  2. What is the solution when my printer shows TRB-001?
  3. What kind of error codes of associated with Installation stage?
  4. The Ref Code states it is TRB-008 and I have already waited 15 min. How much longer?

Level 2: Cluttered Table

Here we have a slightly more cluttered table from the finance industry where the date spills into text beyond just grid. Regular PDF readers are likely to butcher this content.

Here the questions would be:

  1. What is the Gross Carrying Amount for Jan 2023?
  2. What trend do you see in the Patents and licensed tech category?
  3. What is happening with the Amortization expenses?


Level 3: Table surrounded by content.

Here we have a healthcare report where the table is a small part of the image (only about 20%). There is a lot of surrounding content, the model has to ignore and pull only the necessary one and be able to handle the grid lines so close to the text. And the table might be in image form.

Model has to answer questions such as:

  1. What key trends do you see in the table?
  2. How many seniors > 80 years wanted the Bring Water usecase?
  3. What is the value difference in "Bring Medicine" between the two demographics?

Level 4: Large Tables adjanced to each other

Here we have a much larger table and multiple tables next to each other. This would trip most multimodal models.


  1. What is happening with the 1951/52 National Cup?
  2. What differences do you notice between the data from the Ground Truth Table and Model Prediction Table?
  3. Help me understand the relationship among the 3 tables here?

Level 5: A combination of tables

Here we have low-res image of an invoice.

Key questions:

  1. What is the Balance Due?
  2. When is the Due Date?
  3. How much discount did we get in this order?

While this is a harder task than the previous ones, many models have been tuned on Invoice related tasks and thus might score higher.

Level 6: Chart

Here the task is to extract the values out of the chart and populate the table as best as possible.

Level 7: Messy handwritten tables

In many developing countries it is not rare to see handwritten tables in invoices, receipts, annotation and notes.

Answer from this table:

  1. What was the cost of the soap stone?
  2. How many cutting nippers were bought?
  3. What was the total amount?


I'm pulling these content from a variety of industries from across the world, in multiple languages, multiple domains and multiple layers of difficulty.

I will release the final benchmark in 2 weeks. In the meanwhile is there a type of table you want to see in this benchmark? Leave your answer in the comments.


Arun Rajesh Balakrishnan

R | Power BI | Generative AI | Stress Testing |Collection Analytics | Lean Six Sigma

8 个月

Dr. Balaji Viswanathan thanks for sharing... Which LLM multimodal (open source local LLM) is suitable for Level 3 : table surrounded by content?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了