登录查看更多内容

3. We got multilingual documents, so why not design a model that attends the language feature during fine-tuning, instead of pre-training?

Akarsh Upadhyay

Applied Scientist @ Microsoft

发布日期: 2022年10月23日

Most existing models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding - Official Paper

1. The problem of language in Document Understanding

Around us, we have documents of various kinds such as Forms, Receipts, Magazines, etc., and that too in different languages such as English, Hindi, and many others.
For any document, we can see that the fundamental blocks that make it up are the layout (the location of the words in a document/orientation), the semantics/textual (meaning of words and sentences), and the image.
So, while devising an algorithm, we need to consider this set of features, to model the document-related task (for example, Document Classification, Token Classification, etc.). In the current scenario, to solve this set of tasks, firstly we pre-train the model on a large collection of unlabelled datasets with tasks such as Masked Language/Image Modeling so the model focuses on "understanding and diffusing these features properly". And then, we fine-tune the model by removing the last layer and attaching a task-specific linear layer.
Here comes an issue, when pre-training on the language is performed (i.e the semantic features), on fine-tuning the model, the bias of language comes while solving the specific task. Although the model would perform better on the fine-tuning task of that specific language it was pre-trained on, there would be bias when extending the same model to other languages.

Almost all of them only focus on pre-training and fine-tuning the documents in a single language, typically English. This is extremely limited for other languages, especially in the case of lacking pre-training structured document data - Official Paper

2. Problem has been formulated, how to proceed?

We have identified the problem, that we need to extend the model for multi-lingual documents, but there is a problem with the model towards a specific language for pre-training. So, let us try to solve this problem. The intuition is to focus on the layout part during pre-training and make the model language-independent so that the model can extract those sets of features finely, and then use those weights for fine-tuning, while taking into account the semantics of the document as well.

When the layout structure remains unchanged, the substitution of language does not make obvious unnaturalness. It fully motivates us to decouple and reuse the layout invariance among different languages. - Official Paper

This is an idea, that is being adopted in the paper "LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding". While most of the part of the model remains the same as that of the original transformer paper "Attention is all you need", the novelty comes in the attention.
As can be seen from the above figure, when we are pre-training, the net attention corresponding to the layout would be the addition of layout attention plus the detached textual attention, and the net attention for textual features is the sum of individual attention (i.e layout and textual).

We adopt the detached version while calculating the layout attention so that the textual stream will not be affected by the gradient of the non-textual one during pre-training and its overall consistency can be preserved.?- Official Paper

Intuition can be thought of as, while pre-tuning, we don't want the layout weights to get modified due to the language features, but on fine-tuning, we want to take into consideration the textual features as well for accounting for the weights of the layout.

3. How about the pre-training tasks?

The authors propose three self-supervised pre-training tasks to guide the model to autonomously learn joint representations with cross-modal cooperation. More on this can be found in the paper mentioned in the references.
One of them is Masked Visual Language Modeling, which focuses on masking the visual/ languages tokens and then the model is asked to predict the tokens and the loss is calculated using cross-entropy loss. The other one is Key Point Location, wherein the whole document image is divided into a set of segments, and then the model is required to predict which regions the key points of each box belong to. And the last one is, Alignment Identification, wherein the model is asked if the token/masked boxes align with the given document image or not.

领英推荐

Gen AI picks up Indian languages, top tech skills for…

LinkedIn News India 1 年前

Assessing integrated language skills

Duolingo English Test 1 年前

No Language Left Behind

Bhasker Gupta 2 年前

4. My Experiments and attention visualization:

I tried to implement the whole model and without pre-training, used the model for training and testing on a sample dataset of RVL CDIP Dataset. The whole set of implementations and notebooks can be found in my GitHub repo mentioned in the references. In the first two figures, we can see the original document image and the extracted words and bounding box corresponding to the document image.
While we won't be discussing the whole training procedure and the results, I would like to show you the attention of the first layer of the transformer for a given sample image described in the above set of figures with its OCR. Since there are 12 heads, we would just see the attention of one of the heads for each feature
The third figure describes the attention to textual features and the fourth figure describes the layout attention for different heads. As we can see, for the textual attention to later parts of the word (refer to the Y-Axis of the third figure), have a strong association with the earlier part of the words (refer to the X-Axis of the third figure). And similarly, the later bounding box (corresponding to later words, refer to Y-Axis in the fourth figure) has a strong association (lighter color) with earlier bounding boxes. Similarly, interpretation can be done for other heads in different hidden layers.

5. Conclusion:

In this article, we discussed the problem of pre-training on a single language and then discussed the paper?“LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding”,?which makes the SDU (Structured Document Understanding) tasks enjoy language-independent benefits from the pre-training of document layout structure
This paper described the problem of Document Understanding with a novel attention mechanism, which would be really helpful for extending the model for multi-lingual documents.

6. References:

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding: 2022.acl-long.534.pdf (aclanthology.org)
My Implementation: uakarsh/LiLT: My Implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (github.com)
Attention Visualization: LiLT/LiLT_5. Visualizing the attentions.ipynb at main · uakarsh/LiLT (github.com)
Image Credits: Google and other sources

COMMENTS AND FEEDBACK ARE MOST WELCOME. THANKS FOR READING TILL HERE.

要查看或添加评论，请登录

Akarsh Upadhyay的更多文章

Learnings Part: 2 Dealing with Data

2024年4月9日

Learnings Part: 2 Dealing with Data

Context: In my previous article, we discussed about what it means to understanding a problem statement Takeaway:…
Learnings Part: 1 Approaching ML Problem

2024年4月7日

Learnings Part: 1 Approaching ML Problem

Context: I want to reflect on my learnings, that I have gained while working as ML Intern at Enterpret and for the…

3 条评论
2. Worried about using OCRs for Document Understanding, DONUT got you covered ?

2022年8月16日

2. Worried about using OCRs for Document Understanding, DONUT got you covered ?

Understanding document images (e.g.

2 条评论
1. The child's "ABCs" and the concept of class Imbalance...

2022年5月31日

1. The child's "ABCs" and the concept of class Imbalance...

Imagine a youngster (who is curious as well as joyous), as represented in the image below. It was time for his…

3. We got multilingual documents, so why not design a model that attends the language feature during fine-tuning, instead of pre-training?

Akarsh Upadhyay

Applied Scientist @ Microsoft

1. The problem of language in Document Understanding

2. Problem has been formulated, how to proceed?

3. How about the pre-training tasks?

领英推荐

4. My Experiments and attention visualization:

5. Conclusion:

6. References:

Akarsh Upadhyay的更多文章

其他会员也浏览了

LanguageLine Roundup Vol. 4: The Future of Translation

Think different, think local ??

7 Linguistic Meanings that Determine Every Language Process

Examples of poor Critical Thinking in published work on SLA and ELT Part 1

Issue 14: The power of Multi-lingual LLMs

Language Tech through Time: A Lookback at the Linguist’s Landscape

[S2-AIGAI] How Generative AI helps people to conversate in different native languages

Paper Review: Translatotron 3: Speech to Speech Translation with Monolingual Data

5 Datapoints Defining Language Services in 2024

The Accuracy of AI Detection Models for Non-Native English Speakers

1. The problem of language in Document Understanding

2. Problem has been formulated, how to proceed?

3. How about the pre-training tasks?

领英推荐

4. My Experiments and attention visualization:

5. Conclusion:

6. References:

Akarsh Upadhyay的更多文章

Learnings Part: 2 Dealing with Data

Learnings Part: 1 Approaching ML Problem

2. Worried about using OCRs for Document Understanding, DONUT got you covered ?

1. The child's "ABCs" and the concept of class Imbalance...

其他会员也浏览了

LanguageLine Roundup Vol. 4: The Future of Translation

Think different, think local ??

7 Linguistic Meanings that Determine Every Language Process

Examples of poor Critical Thinking in published work on SLA and ELT Part 1

Issue 14: The power of Multi-lingual LLMs

Language Tech through Time: A Lookback at the Linguist’s Landscape

[S2-AIGAI] How Generative AI helps people to conversate in different native languages

Paper Review: Translatotron 3: Speech to Speech Translation with Monolingual Data

5 Datapoints Defining Language Services in 2024

The Accuracy of AI Detection Models for Non-Native English Speakers