Introduction to Transforms - English version
As part of my thesis work, I recently delved into transformer architecture. As I read the first articles in the field (Attention is all You Need, ViT), I struggled to process the concrete ideas on which transformers are based on into abstract ideas.
As a result, I delved into the subject, analysed code from a number of sources on the Internet alongside reading the articles, and finally conducted a Q&A round with ChatGPT to sharpen my understanding.
After a week and a half of intensive research, I came to the conclusion that I want to share the knowledge I have accumulated with the world. Therefore, I decided to write a series of posts through which I can express my understanding, and perhaps shorten the process for those who do not have the time I had to invest in this learning. As part of the writing, I will cover the history of the field, network structure, advantages and disadvantages, and how to train it.
In the second stage, I will cover articles in the field of computer vision that present the use of the network for tasks such as segmentation, image matting, body gait transfer, classification, etc.
To do this, I approached Michael Erlihson , for scientific editing, and together we embarked on this journey.
领英推荐
Historical Overview
Transformers were initially proposed as a solution to the problem of text analysis. the first article in the field (Attention is all You Need), presented an English to French translation and opened the gate to the NLP revolution we are seeing today.
Transformers introduced several new ideas in the field of language analysis. The two main ones, intertwined in one architecture and constitute the conceptual building blocks of the network (in addition to other innovations that the article presented). The first was parallel processing of information, which led to computational efficiency in training the model compared to previous models (RNN\LSTM\GRU), and for the first time made it possible to break through the barrier of time-dependent learning of sequential input, i.e., dependencies can be learned in parallel, both short and long range in sequential input. Note that transformers are also limited in their ability to process parallel input, but this is a limitation dependent on computational resources such as available memory and processing units. (Currently the maximum number of tokens is 1024 [3])
In addition, the network introduced two attention mechanisms, the first is self-attention which allowed the model to focus on the most important information selectively. And the second is cross-attention, which allowed the model access to different parts of the dependent input for each output token. These features are necessary in tasks such as translation, answering questions, and summarizing text, which require conscious selection of the most important parts of the input and output.
These ideas excited me and made me want to understand what led to their development. In order to do so, I started to review the architectures that preceded the transformers, how they worked, and why they did not succeed in the task that the transformers did succeed in.
Head of AI @ Cyber | Math PhD | Scientific Content Creator | Lecturer | AI Influencer | 2 * Podcast Host(50 podcasts about AI & math) | Deep Learning(DL) & Data Science Expert | > 400 DL Paper Reviews | 59K followers |
1 年So, cool. I'll share it later. Thx for your substantial effort, Hay!!