3. We got multilingual documents, so why not design a model that attends the language feature during fine-tuning, instead of pre-training?
Credits: Google

3. We got multilingual documents, so why not design a model that attends the language feature during fine-tuning, instead of pre-training?

Most existing models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding - Official Paper

1. The problem of language in Document Understanding

  • Around us, we have documents of various kinds such as Forms, Receipts, Magazines, etc., and that too in different languages such as English, Hindi, and many others.
  • For any document, we can see that the fundamental blocks that make it up are the layout (the location of the words in a document/orientation), the semantics/textual (meaning of words and sentences), and the image.
  • So, while devising an algorithm, we need to consider this set of features, to model the document-related task (for example, Document Classification, Token Classification, etc.). In the current scenario, to solve this set of tasks, firstly we pre-train the model on a large collection of unlabelled datasets with tasks such as Masked Language/Image Modeling so the model focuses on "understanding and diffusing these features properly". And then, we fine-tune the model by removing the last layer and attaching a task-specific linear layer.
  • Here comes an issue, when pre-training on the language is performed (i.e the semantic features), on fine-tuning the model, the bias of language comes while solving the specific task. Although the model would perform better on the fine-tuning task of that specific language it was pre-trained on, there would be bias when extending the same model to other languages.

Almost all of them only focus on pre-training and fine-tuning the documents in a single language, typically English. This is extremely limited for other languages, especially in the case of lacking pre-training structured document data - Official Paper

2. Problem has been formulated, how to proceed?

Attentions of LiLT

  • We have identified the problem, that we need to extend the model for multi-lingual documents, but there is a problem with the model towards a specific language for pre-training. So, let us try to solve this problem. The intuition is to focus on the layout part during pre-training and make the model language-independent so that the model can extract those sets of features finely, and then use those weights for fine-tuning, while taking into account the semantics of the document as well.

When the layout structure remains unchanged, the substitution of language does not make obvious unnaturalness. It fully motivates us to decouple and reuse the layout invariance among different languages. - Official Paper

  • This is an idea, that is being adopted in the paper "LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding". While most of the part of the model remains the same as that of the original transformer paper "Attention is all you need", the novelty comes in the attention.
  • As can be seen from the above figure, when we are pre-training, the net attention corresponding to the layout would be the addition of layout attention plus the detached textual attention, and the net attention for textual features is the sum of individual attention (i.e layout and textual).

We adopt the detached version while calculating the layout attention so that the textual stream will not be affected by the gradient of the non-textual one during pre-training and its overall consistency can be preserved.?- Official Paper

  • Intuition can be thought of as, while pre-tuning, we don't want the layout weights to get modified due to the language features, but on fine-tuning, we want to take into consideration the textual features as well for accounting for the weights of the layout.

3. How about the pre-training tasks?

  • The authors propose three self-supervised pre-training tasks to guide the model to autonomously learn joint representations with cross-modal cooperation. More on this can be found in the paper mentioned in the references.
  • One of them is Masked Visual Language Modeling, which focuses on masking the visual/ languages tokens and then the model is asked to predict the tokens and the loss is calculated using cross-entropy loss. The other one is Key Point Location, wherein the whole document image is divided into a set of segments, and then the model is required to predict which regions the key points of each box belong to. And the last one is, Alignment Identification, wherein the model is asked if the token/masked boxes align with the given document image or not.

4. My Experiments and attention visualization:

No alt text provided for this image
No alt text provided for this image


Textual Attention
Layout Attention


  • I tried to implement the whole model and without pre-training, used the model for training and testing on a sample dataset of RVL CDIP Dataset. The whole set of implementations and notebooks can be found in my GitHub repo mentioned in the references. In the first two figures, we can see the original document image and the extracted words and bounding box corresponding to the document image.
  • While we won't be discussing the whole training procedure and the results, I would like to show you the attention of the first layer of the transformer for a given sample image described in the above set of figures with its OCR. Since there are 12 heads, we would just see the attention of one of the heads for each feature
  • The third figure describes the attention to textual features and the fourth figure describes the layout attention for different heads. As we can see, for the textual attention to later parts of the word (refer to the Y-Axis of the third figure), have a strong association with the earlier part of the words (refer to the X-Axis of the third figure). And similarly, the later bounding box (corresponding to later words, refer to Y-Axis in the fourth figure) has a strong association (lighter color) with earlier bounding boxes. Similarly, interpretation can be done for other heads in different hidden layers.

5. Conclusion:

  • In this article, we discussed the problem of pre-training on a single language and then discussed the paper?“LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding”,?which makes the SDU (Structured Document Understanding) tasks enjoy language-independent benefits from the pre-training of document layout structure
  • This paper described the problem of Document Understanding with a novel attention mechanism, which would be really helpful for extending the model for multi-lingual documents.

6. References:

  1. LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding: 2022.acl-long.534.pdf (aclanthology.org)
  2. My Implementation: uakarsh/LiLT: My Implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (github.com)
  3. Attention Visualization: LiLT/LiLT_5. Visualizing the attentions.ipynb at main · uakarsh/LiLT (github.com)
  4. Image Credits: Google and other sources

COMMENTS AND FEEDBACK ARE MOST WELCOME. THANKS FOR READING TILL HERE.

要查看或添加评论,请登录

Akarsh Upadhyay的更多文章

其他会员也浏览了