The Foundation Models of Document Understanding
https://venturebeat.com/2021/08/18/foundation-models-risk-exacerbating-mls-ethical-challenges/

The Foundation Models of Document Understanding

I read a very interesting article in the Economist, June 11th 2022 Issue dedicated to AI. The article is called "The World that Bert built". I highly recommend it. It talks about the development of the Foundation Models.

Definition of the Foundation Models (from Wikipedia):

"A?foundation model?is a large?artificial intelligence?model trained on a vast quantity of unlabeled data at scale (usually by?self-supervised learning) resulting in a model that can be adapted to a wide range of downstream tasks. Foundation models are behind a major transformation in how AI systems are built since their introduction in 2018. Early examples of foundation models were large pre-trained language models including?BERT?and?GPT-3. Subsequently, several multimodal foundation models have been produced including?DALL-E, Flamingo,?and Florence.?The Stanford Institute for Human-Centered Artificial Intelligence's (HAI) Center for Research on Foundation Models (CRFM) popularized the term." (BTW Stanford HAI gave this name to these model types.)

These are models that require immense resources, hyper-fast computers. And basically what they do ? They build something like the "common sense" of AI. The idea behind these is to run AI models with more and more parameters. A few of the best-known experiments with these super large models are the Microsoft Florence model and GPT-3 a model made by Open AI.

Unfortunately these models also bring with them ethical challenges, you can read about them in the link below.

Another new important fact related to these Foundation Models is that they become multi modal. Multimodal machine learning is a multi-disciplinary research field which addresses some of the original goals of artificial intelligence by?integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. It is exactly how humans reason by using multiple communication channels.

I am sure that this concept will be applied in the Document Understanding space. Already some companies are doing similar things at a smaller scale.

Companies will start ingesting every document type possible , reading repositories, archives , etc. and building super large models. In Semantic Processing, in the 21st Century second decade, an attempt was by creating DBPedia, a semantic database obtained by analyzing Wikipedia. To build these models is super expensive, you need access to super computers and to an immense quantity of data. In the current setup this will not be accessible to every company due to the high cost of these resources.

No alt text provided for this image

On the slide above from the HAI Spring Conference you can see the broad applications that these models can be used for.

This DU Foundation Model will be able to classify any document on earth, written in any language. So more documents you process, better your model is. Ideally a consortium of companies would create an Open Source DU Foundation Model that will allow every company to tap into this "DU common sense" and focus on specific documents at a granularity that the Foundation Models didn't go through. In other words, every company who builds a model will inherit the common sense of document processing.

There are already companies who did this in the space of Unstructured Document processing. Others provide already universal models for any fixed form document but you still need to map the extraction results to your interest in a document (these are usually called taxonomies).


UiPath just introduced the Forms AI that is a small step in this direction.

So definitely the future of DU is bright and this domain will become a commodity in not a very far future.


Some links for this subject:

要查看或添加评论,请登录

George Roth的更多文章

社区洞察

其他会员也浏览了