OmniAI的动态

OmniAI转发了

查看Tyler Maran的档案

CEO OmniAI (YC W24) | best code slinger this side of the Mississippi

We know LLMs can read a pdf, but are they any good at it? Turns out you need a lot of PDFs to answer that question! Let's talk document extraction. Everyone's building some variant of this (ourselves includes), but how do you know that it actually works? Is the LLM 70% accurate, 90% accurate, better than a human? Turns out it's a pretty tough thing to benchmark. Mostly because you need a LOT of correctly annotated documents, and those aren't easy to find. Especially when it comes to the types of real world document problems that LLMs are supposed to be solving. Traditional OCR benchmarks are looking at text similarity on things like textbooks or receipts. But that falls short in a couple ways. 1. The data sets aren't representative of real world data 2. Those benchmarks are more focused on character recognition than structured extraction. And just having all the letters off of page doesn't get you very far. So lets built a better benchmark! Next week we'll launch our open source VML benchmark. This will include a full validation dataset consisting of: - Document images - Markdown - JSON Schema - Validated JSON response As I mentioned above, our main goal is LLM based document extraction. To benchmark this, we’ll be validating the two most common patterns. ?????????? ? ???????? ? ???????????????????? This is the most common workflow. You run OCR on a document, and pass the resulting text to an LLM along with the JSON schema for extraction. We will be only be evaluating providers on their OCR accuracy, and using GPT-4o structured output for the extraction. ?????????? ? ???????????????????? For multimodal LLM providers, we will run a separate test of direct extraction without the OCR step (i.e. GPT 4o & Anthropic PDF). So far we've added the following providers: - OmniAI (of course!) - Azure Document Intelligence - AWS Textract - Google Document AI - Unstructured - GPT 4o Vision - Claude Sonnet 3.5 - Llama 3.3 Vision - Deepseek R1 Let me know in the comments if you want someone else added to the list:

Tyler Maran

CEO OmniAI (YC W24) | best code slinger this side of the Mississippi

3 周

Sneak peek: Deepseek R1 is not super impressive at document extraction. About the same accuracy as 4o. And if you want to read a bit about the process you can check it out here: https://getomni.ai/blog/infinite-pdf-generator

回复
Daniel Emaasit

Building The AI Assistant for Global Supply Chain

3 周

PaddleOCR

Alex Cardell

Director, Strategic Finance at Synovus | Divisional Finance Leader w/ Power BI, GenAI, & Data Analytics toolset

3 周

Qwen VL 2.5 and the colpali & colqwen approaches.

Victor Zhang

Document processing | Data & AI

3 周

Super relevant as PDFs are one of the most prominent document formats (if not the most prominent). For any LLMs to get the workflow right, they have to be able to extract the information from PDFs right. Let alone talking about AI agents to be built on top of them. Looking forward to the benchmark!

Simonas J.

AI / MLOps Engineering

3 周

Can you add Docling and Gemini models to the evaluation?

回复
Brendan Ashworth

Co-Founder @ Bunting Labs | MIT Physics + AI

3 周

But don't you train on the infinite pdf generator data? Isn't that just testing on your training set? Seems like not a fair comparison considering none of textract/4o/claude train on it

Igor Akimov

AI Solutions Product Manager | Help businesses grow with AI

2 周
Ben Cheng

Oursky, FormX.ai, Authgear.com

3 周

That is interesting -- we at?FormX.ai?do it every 2 - 3 months as well, but mainly with the datasets we collected or with permissions from clients. Wondering how you create the dataset? We also found that results vary greatly depending on: - type of documents - various optimizations applied to the OCR results text or not Maybe there is something we could work together :)

John Kalfayan

Building AI & Energy Products | Data Plumber | Startup Masochist | EnergyBytes Podcast Host | Weekly posts about Startups, Data, LLMs, and Energy

5 天前

Llamaparse colipali and Qdrant vlm - https://github.com/qdrant/demo-colpali-optimized gemini flash 2

回复
查看更多评论

要查看或添加评论,请登录