登录查看更多内容

How do we evaluate the Multimodal Models for key enterprise tasks?

Balaji Viswanathan Ph.D.

Building document AI at scale -- organizing, searching and summarizing enterprise data.

发布日期: 2024年3月21日

A lot of the benchmarks to evaluate LLMs come from toy problems that researchers create. While it is interesting to see if GPT4 scores well in answering biology questions from the MMLU benchmark, enterprises might want to see it used in problems closer to their use.

I'm working on a benchmark for table cognition. The goal is to build a new dataset that the models have not seen, and the task involves pulling information from tables and answering questions on them.

I'm building this new benchmark, TableCog, a dataset of tables from variety of industries in increasing orders of difficulty. In this article I will show the examples from each level.

Level 1: Simple Text

Here we have a simple text table of a Manual from the tech industry. This can be easily extracted by most PDF reader tools such as PyPDF. The model should be able to answer questions such as:

What does TRB-009 mean?
What is the solution when my printer shows TRB-001?
What kind of error codes of associated with Installation stage?
The Ref Code states it is TRB-008 and I have already waited 15 min. How much longer?

Level 2: Cluttered Table

Here we have a slightly more cluttered table from the finance industry where the date spills into text beyond just grid. Regular PDF readers are likely to butcher this content.

Here the questions would be:

What is the Gross Carrying Amount for Jan 2023?
What trend do you see in the Patents and licensed tech category?
What is happening with the Amortization expenses?

Level 3: Table surrounded by content.

Here we have a healthcare report where the table is a small part of the image (only about 20%). There is a lot of surrounding content, the model has to ignore and pull only the necessary one and be able to handle the grid lines so close to the text. And the table might be in image form.

Model has to answer questions such as:

What key trends do you see in the table?
How many seniors > 80 years wanted the Bring Water usecase?
What is the value difference in "Bring Medicine" between the two demographics?

Level 4: Large Tables adjanced to each other

Here we have a much larger table and multiple tables next to each other. This would trip most multimodal models.

领英推荐

Artificial Intelligence #224

Andriy Burkov 6 个月前

Artificial Intelligence #173

Andriy Burkov 1 年前

The Curse of Dimensionality in Machine Learning

SmartSoC Solutions Pvt Ltd 5 个月前

What is happening with the 1951/52 National Cup?
What differences do you notice between the data from the Ground Truth Table and Model Prediction Table?
Help me understand the relationship among the 3 tables here?

Level 5: A combination of tables

Here we have low-res image of an invoice.

Key questions:

What is the Balance Due?
When is the Due Date?
How much discount did we get in this order?

While this is a harder task than the previous ones, many models have been tuned on Invoice related tasks and thus might score higher.

Level 6: Chart

Here the task is to extract the values out of the chart and populate the table as best as possible.

Level 7: Messy handwritten tables

In many developing countries it is not rare to see handwritten tables in invoices, receipts, annotation and notes.

Answer from this table:

What was the cost of the soap stone?
How many cutting nippers were bought?
What was the total amount?

I'm pulling these content from a variety of industries from across the world, in multiple languages, multiple domains and multiple layers of difficulty.

I will release the final benchmark in 2 weeks. In the meanwhile is there a type of table you want to see in this benchmark? Leave your answer in the comments.

Arun Rajesh Balakrishnan

8 个月

Dr. Balaji Viswanathan thanks for sharing... Which LLM multimodal (open source local LLM) is suitable for Level 3 : table surrounded by content?

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How do we evaluate the Multimodal Models for key enterprise tasks?

Balaji Viswanathan Ph.D.

Building document AI at scale -- organizing, searching and summarizing enterprise data.

Level 1: Simple Text

Level 2: Cluttered Table

Level 3: Table surrounded by content.

Level 4: Large Tables adjanced to each other

领英推荐

Level 5: A combination of tables

Level 6: Chart

Level 7: Messy handwritten tables

更多精彩文章

社区洞察

其他会员也浏览了

Conquering Class Imbalance: Techniques and Strategies for Robust Models

Maximising ML Model Performance: The Importance of Data Sample Selection

FiftyOne Computer Vision Community Update – September 2023

The Rubik’s Cube of Reason: Assembling a League of Extraordinary Algorithms With Ensemble Models

From Memorisation to Generalisation: How to Tackle Overfitting

Diagram GPT's for Seeing Connections in a SWMM5 in Input File

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

AiN # 20: Timeseries Forecasting: LLMs for Timeseries.

A chat with GPT

The promise of Foundation models and where we are now

Level 1: Simple Text

Level 2: Cluttered Table

Level 3: Table surrounded by content.

Level 4: Large Tables adjanced to each other

领英推荐

Level 5: A combination of tables

Level 6: Chart

Level 7: Messy handwritten tables

What's So Challenging About Building Chatbots? Drawing lessons from the trenches.

2024年4月27日

Wondering How to Hire an AI? Evaluating Large Language Models (LLMs) could be made better.

2024年4月24日

Candid Lessons in Building Humanoids

2024年3月20日

Langflow: A simple way to build LLM applications locally without code.

2024年3月8日

How Gemini and GPT4 completely messed a standard task that Claude 3 easily did.

2024年3月5日

Implementing Modern AI in your Enterprise.

2023年5月5日

The cool robots we build: Big collection of unedited videos and pictures of our robots.

2020年12月12日

Pivoting in the COVID era.

2020年5月14日

Why were the desperate Indian migrants take railway tracks to get home?

2020年5月10日

Email is wrongly vilified

2020年4月27日

社区洞察

其他会员也浏览了

Conquering Class Imbalance: Techniques and Strategies for Robust Models

Maximising ML Model Performance: The Importance of Data Sample Selection

FiftyOne Computer Vision Community Update – September 2023

The Rubik’s Cube of Reason: Assembling a League of Extraordinary Algorithms With Ensemble Models

From Memorisation to Generalisation: How to Tackle Overfitting

Diagram GPT's for Seeing Connections in a SWMM5 in Input File

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

AiN # 20: Timeseries Forecasting: LLMs for Timeseries.

A chat with GPT

The promise of Foundation models and where we are now