What is a transformer?
馬Antony 裕杰
Cybersecurity SaaS developer. Web Isolation and Augmented Whitelisting. #Insurtech #decentralised_insurance
One reason ChatGPT is so powerful is that it utilizes a new neural network architecture called the Transformer, which was proposed in a paper by Google engineers in 2017. There are two features that make Transformers powerful: firstly, they are designed for parallel processing of sequential data, such as textual sentences, which accelerates the training process. Secondly, the paper introduced an attention mechanism that captures the importance of the context of words, rather than just their sequence.
While I only know the basics of the Transformer architecture, I will attempt to explain the essential elements without delving into too much detail.
In my last post about text embedding, I used English dictionary to explain the concept of text embedding. We can represent sentences using multi-dimensional vectors. The true value of this representation lies in the fact that each word exists in relation to other words, which imparts meaning. Publishers arrange words alphabetically in a dictionary to facilitate easy access for readers. However, this alphabetical order does not hold any semantic value. Starting with A and ending with Z is simply an efficient organization and search method for humans.
On the other hand, a dictionary for a computer neural network does not need to adhere to alphabetical order. If we were to ask a neural network to create a dictionary using only the text from the Bible, this dictionary would represent all the relationships between words in the Bible.
A Transformer can be likened to a dictionary assistant, capable of searching this new dictionary using multi-dimensional vectors. To handle user queries, a Transformer consists of an encoder and a decoder.
For instance, if you were to use this dictionary to find the answer to the question, "Where was the Son of God born?", the computer must understand that "Son of God" should be considered as one unit and is equivalent to "Jesus the Messiah" rather than "John the Baptist." The encoder provides the context, importance, and relationships of words according to the biblical text. The encoder achieves this by utilizing a method called Positional Encoding.
"Positional Encoding: In natural language processing, the order of words in a sentence is crucial for determining the sentence’s meaning. However, traditional machine learning models, such as neural networks, do not inherently understand the order of inputs. To address this challenge, positional encoding can be used to encode the position of each word in the input sequence as a set of numbers. These numbers can be fed into the Transformer model, along with the input embeddings. By incorporating positional encoding into the Transformer architecture, GPT can more effectively understand the order of words in a sentence and generate grammatically correct and semantically meaningful output."
The encoder generates a vector to store all contextual values, which is then passed on to the decoder. Subsequently, the decoder processes these contextual values to determine the appropriate output. For instance, the words "where" and "born" pertain to a location. Both Bethlehem and stable are possible answers. The decoder has to decide which answer is more relevant to the input. It is crucial to note that the output does not rely solely on the sequence of words but also takes into consideration their context, allowing for a more accurate and nuanced understanding of the text.
In the ChatGPT 3 transformer, there are 96 layers of encoders and decoders. This advanced mechanism of the Transformer architecture, as well as its ability to process sequential data in parallel, contributes to the exceptional performance of models like ChatGPT, making it a vital topic of discussion among IT professionals.
PS:
If you are too busy to catch up with what are happening in AI industry globally, below are some links I found both insightful and has long term impact .
2. 10 Reasons to Ignore AI Safety
Cybersecurity, Risk, Compliance & IT Governance leader with wide ranging exposure across IT | CISO | CISM, CCSP, CISSP, SABSA, ITIL
1 年Antony, great job in making things simple. Your explanation of text embedding brought to mind the controversy surrounding the slow publication of the Dead Sea Scrolls. Two scholars reconstructed texts of the unpublished scrolls using information from published concordances (listing of every word in a text and their immediate context). Up till then, all these cross-refereces and indexes were manually collated, I believe this was the first instance a computer program was used to "recover" the text from the references. That was in 1991 and computers have progressed by leaps and bounds since.