If
“Attention is All You Need” 
Then 
“Recognition is Your First Want“
“Attention is all you need” is the title of a 2017 machine learning paper, by willibrand https://knowyourmeme.com/users/willibrandt

If “Attention is All You Need” Then “Recognition is Your First Want“

Introduction

While researching on the academic fronts, I was wondering what caused the sudden explosion in #GenerativeAI in last few years. I had heard many stories about it but one odd research paper caught my attention. Even by Subject it was named as “Attention is All You Need” as it really caught the thought of #GenAI revolutionary moment from 2017 (Vaswani et al., n.d.). Soon after this we saw BERT, OpenAI, T5 etc taking wings around the world of regenerative content in language text processing, image classification etc. Essentially till then , it was a slow motion work from ?AI to ML and DL but it made an exponential leap jump into GenAi with LLM models by creating whats unknown in text, voice, picture or videos.

Source : Murgia,2023

So when and how this Attention Paper changed the dynamics and it was a no nonsensical subject of my research that could be explained in plain English as my curiosity spawned from this FT article last year (Murgia, 2023).

Hello World – Baby AI responds

I smiled at a crying baby during a take off and it reacted with smile followed by a laughter as reaction to my seemingly happy gesture. I was wondering which cognitive intelligence worked and can the best of man-made AI Machines someday respond like this to stimulus of human emotional sensitivity. How did a small baby decipher smile or frowning face and within few months of its inception, respond back firmly. Which training data , algorithm or fundamental deep learning model can rightly predict such human behavioral patterns. ?Throughout the flight , baby learnt repeatedly to smile and look around for new expression like a machine language model ( todays LLMs).? It didn’t have any labelling of seating pattern of its parents or an attention context of stranger but slowly figured and responded with wisdom. ?It was time for me to compare, contrast, learn and unlearn accordingly as I was also interested in this subject beyond the normal song and dance of that little baby.

Discovery Hello World AI

While our biological Neural networks have brain function to recognize patterns but how can AI led Neural Network mimic it and recognize handwritten digits or have distinction between cat , tiger and lion. The Human brain gets signals from the craziest yet smartest visual cortex that recognizes distinction easily. While using layers of interconnected neurons, the sensory processes of our Central Nervous System decides finally the pixel data as a particular object. Now imagine, if these Neural networks are inspired or replicating the brain function of that baby ?to create the basic structure and algorithm of AI neural networks. Did you realized, that we wrote a simple Hello World AI program model in last few lines with No code at all.

Understanding Technical Fundamentals

Before ?appreciating the context of the 2017 Attention paper , few concepts need to be appreicated as below as they will widen our introspection and help invigourate our curiosity on this subject

The Neural Network Character

Let’s first conceptualize Neural networks as inspiration from such biological neurons using interconnected filter layers to facilitate pattern recognition and emulate human cognition. For example the ability of neural networks to recognize digits from low-resolution images, or image processing? of smile in sensory objective pixels.

Adopted from

One critical introduction would be thinking of Non Biological Neurons in the network to serve as activators holding infinite real number values between 0 and 1 as if reflecting the activation level based on capture input.


Adopted from

This mathematical input will serve as a foundation to how neural networks interpret data. For every pixel that an artificial baby AI program will see , it has a recognizable activator to unleash the mathematics function at every layer based on the input and beware there are thousands or even millions of them in one shot. A mathematical function assigns output to it , with multiple filter levels and hidden layers to hold context till we get the right outcome which matches with expected results.?

Weigh with bias as you process!

During the process, it will be breaking down the luminous Neurons , its patterns into layers of abstraction or even the parsing speech to text while taking raw audio and picking out distinct sounds and characters, with high , medium or low frequencies.? What determines the character of the speech as growl , noisy? , annoyed , shouting from tiny words , broken into characters ?or syllables , then correspond it into a well-orchestrated thought as spoken or written earlier without any loss of intent or transmission. An old saying is weigh your words carefully before speaking and in same manner, the layers weight the luminosity function before processing them. Like we give more weightage to recognized faces v.s unknown face should be differently processed or a sharp feature on face recognizes a particular pattern differently.

Essentially like this, at the micro level with every pixel of character, picture or voice note gets a weightage based on various parameters and thus mathematical models process as per the weightage attached. ?The Neural Network of deep learning enables into multiple Layered transformation network, sending inputs to outputs via weighted connections or even breaking the patterns into sub elements or recombining them like loops or systematic arranged plates. The layered architecture of neural networks allows for complex transformations of inputs into outputs but adds bias in every transaction to get exact outcomes, akin to how humans process information in stages and relearns with its cognitive bias adjustments (Jacobs, 2003).

Inside Mathematical Models ?

Utilizing weights and biases matrix operations simplifies the representation of neuron connections, making it easier to implement and optimize neural networks in programming environments, through a simple nonlinear activation functions x, y ,z coordinates.? As if it is looking at its loss function and trying to minimize errors in every outcomes based on its inputs of training or test data. The functional modelled code and mathematical representation came with evolution of activation functions that simplifies the complex processes but modern networks often favor rectified linear unit ReLU (Talathi & Vartak, 2015) over vector matrices product functions in sigmoid functions (Han & Moraga, 1995) for efficiency.

To explain Sigmoid function is simply as a logistic curve that basically brings very negative inputs close to 0, positive inputs end up close to 1. So the activation of the neuron will? basically measure the positivity of the relevant weighted sum (in few millions ) with squish-fication of biased ( at least in thousands ) in this mechanism. ?Shift from sigmoid to ReLU activation functions in modern networks exemplified the continual advancements in training techniques, enhancing the efficiency and performance of deeper neural architectures.? The Real genius involves continuous resetting and reworking with the weights and bias redesign, to get the right algorithmic model out.

Remember its still susceptible to biased outcomes and can go horrible wrong for example if it identifies your domesticated cat as small leopard at home(Jones & Steinhardt, 2022). Thus efficiency is determined by the parameters, functions and measurements and finally performance of overall model and at each layer of neurons, interacting with decisions of what to leave and what to carry out further to next layer of neurons.

Calculus is Poetry of AI !

Nevertheless, as I was thinking about the efficiency and performance of the model. What is the least effort for baby to respond to recognize smile and respond back with smile. I found that it was very simply a Cost function that acts as a measure for efficiency of this network but it adds another layer of mathematical complexity on top of weights and biases as inputs and spits out a single vector with negative gradient at each layer of every neurotic transmission. ?It simply tells us what a nudge to any or all these weights and biases cause the fastest change to the value of the cost function. we can simply say as to which changes, the weights matter the most for our perfect outcome. ?

Technically what it means is a process of repeatedly nudging (training ) an input of a function by some multiples with the differential lowest value slope in what we call gradient descent. It's a way to converge towards some local minimum of a cost function with its relative differential magnitudes of all components highlighting the changes which matter more. The steepest descent of differential gradient means to minimize the error fastest and quickest. ?Imagine the least cost or most optimised route of climbing the mountain valley. Gradient Descent is an optimization algorithm used to minimize the cost function in machine learning models. It iteratively adjusts the model parameters (weights and biases) to find the minimum value of the cost function.


A virtual representation of gradient created with Dalle

The main idea is to move in the direction of the steepest descent, which is determined by the negative gradient of the cost function.? The Cost Function, also known as the loss function, measures the error between the predicted output and the actual output. It quantifies how well the model is performing. The goal of training a machine learning model is to minimize this cost function.

Sharpening Backpropagation

While each neuron’s influence is traced through multiple paths in multilayer network yet understanding the algorithm aids in grasping neural network functionality. Ample labeled training data is crucial for effective learning and derivatives help minimize costs through iterative adjustments.? The specificity and sensitivity of these models in real life varies with backpropagation as the key algorithm that enables neural networks to learn by adjusting weights and biases based on training data, ultimately minimizing the cost function. Multiple layers of Neural Network function via chain rule expressions through derivatives that determine each component in the gradient that helps minimize the cost of the network by repeatedly stepping downhill towards minimal.The Concept of backpropagation acts like comb in the matter hair of the artificial baby brains throughout the network computes steepest gradients to optimize weights and biases by back feeding the results and it uses a chain rule to calculate the sensitivity in neural networks.

So if datasets were labelled and structures, it was better but for others, it was a nightmare. Static back propagation is used in Static neural networks like Spam Filters where it moves from input to output layers using static contents like letter recognition or email addresses etc . however Dynamic or Recurrent Neural networks who ahyve loops or recursive networks like sentiment analysis or timeseries predictions like stock prices etc. Imagine learning difference in pronouncing? ‘Timo ?????? and Teemu ????’ when talking about T in English Alphabets. So while training the model to distinguish the cat and leopard or difference between 3 and 8 or typing E v.s F , we can imagine that calculus cost functions being repeatedly fed with training data till it memorizes and then gives correct outcome. Looks prehistoric 1980? but that’s where the hindrance of GenAI was for decades and a costly practice.

Reflecting Pre-Generative AI ?

Remember when Radios, transistors , Computers , CPUs were getting researched, there was a parallel stream of academia working on AI already. Taking a leap back from history , we saw emergence of ANN or Artifical Neural networks with unsuperviser, supervised ot reinforced learning conceptualised as early as 1950s (Anderson & Mcneill, 1992; Katz et al., 1992).? The times of ELIZA development By 1980s we entered the CNN or convulsional Neural network with multiple filter layers reducing the parameters but had transfmission losses in sequential processing(O’Shea, 2015) , however they stay ever relevant due to leveraging of the rapid growth in the amount of the annotated data and the great improvements in the strengths of? GPUs (Gu et al., 2018) .? Recurrent Neural Networks emerged with referencing and retaining the information of previous layers through internal hidden state making them more efficient(Schmidt, 2019) but had vanishing graudual discent prolem that we studied earlier limiting their use in long and complex processing(Grossberg, 2013).?

There has a decade full upgrades with the concept of backpropagation through time (BPTT) by Paul Werbos in the 1970s? and used for training RNNs to address their shortcomings of self-learning (Werbos, 1990). The foundational work by David Rumelhart and others in 1986 further developed the learning procedures for recurrent neural networks RNNs in stunning basic theory (D. Rumelhart et al., 1996). RNNs have since evolved, especially with the introduction of Long Short-Term Memory (LSTM) networks in the 1990s, which addressed some of the limitations of traditional RNNs (Hochreiter, 1997) . ?Inbetween there were Ngrams and Word2Vec which catapulted thoughts but with limited reach out to objective outcomes.

Birth of Generative AI

Significantly in 2017 ‘Attention paper’ came up with thought of parallel rather than sequential approaches and introducing transformers attention mechanism on top of the weighing mechanism to bring significance of different parts of the input data. It was a revolution from serial sequential process unlike RNNs and LSTMs, it ?processed entire sequences simultaneously like a super brain and art of parallel processing made transformer superefficient to appreciate the complex relationships in data for example keeping context and grammar intact in AI led Translation or image processing features. It took the leaf out of the Human imagination and cognition to focus on what relevant and needs attention.

If this process is consumed at Industrial scale where Pre-feeding of large data is done already helping the quick process in encoding or decoding of process outcomes. Thus LLMs took pace into the formal Transformers maturity of these models arrived one after the other helping in sentiment analysis , text to speech or vice versa, Image creation, object recognition and analysis etc It took what was essentially a thought to be Data Science to altogether new dimensions.

Structurally transformers have two components?self-attention and positional encoding. While the former focuses on every token ( say word) to be contextualised and paraphrase relationship with others yet the later part of PE giving the model a sense of the order of words or elements in the sequence. Thus is a typical fasion of encoding and decoding at scale is established to bring a better plan on table(W, 2024) . As their paper read

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose?a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”

They trained their model in just 3.5 days on eight NVIDIA GPUs, a small fraction of the time and cost of training prior models for a billion pairs of word. Taking a realm of the models that developed soon after and owe billions to these academic researchers as below(Goel, 2023) ?

1.???? 2018: Transformers Era began with of BERT (Bidirectional Encoder Representations from Transformers) from google as name suggests read the text from both directions rather than sequential book reading and enhancing the attention span and systematic efficiency gains.

2.???? 2019 T5, or "Text-to-Text Transfer Transformer." Was able to contextualize text sentiment along with translation and rapid Question Answer session without losing the context.

3.???? 2020 ?GPT 3 Generative Pre-trained Transformer, developed by OpenAI with 175 billion parameters to bring tectonic shakeup in Industry.

4.???? 2021: Google’s Switch Transformer and DeepMind’s Gopher models of LLMs, had 1.6 trillion and 280 billion parameters, respectively.

Since 2022: Continued innovation with models like Anthropic Claude, Google PaLM, and Meta OPT-175 B with subtle discourses on releases from Google Bard or Gemini), Microsoft Copilot, IBM?watsonx.ai, and Meta’s open-source Llama-2 large language model. Take a pause here to remind once again these foundation models are still a generalist: It might know a lot about a lot but often can’t generate specific types of output with desired accuracy or customization.(Loss, 2024)

What started as a research paper 2017 NeurIPS conference has made an Industry of worth billions of dollars and part two of this article will delve into another fundamental aspect. How GPUs Chipset Infrastructure Architecture enabled this Growth at right moment otherwise we wouldn’t have been where we are today.


References

1.???? Anderson, D., & Mcneill, G. (1992). ARTIFICIAL NEURAL NETWORKS TECHNOLOGY A DACS State-of-the-Art Report.

2.???? Goel, S. (2023). Evolution of transformers — part 1. https://sanchman21.medium.com/evolution-of-transformers-part-1-faac3f19d780

3.???? Grossberg, S. (2013). Recurrent neural networks. Scholarpedia, 8(2), 1888.

4.???? Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., & Cai, J. (2018). Recent advances in convolutional neural networks. Pattern Recognition, 77, 354–377.

5.???? Han, J., & Moraga, C. (1995). The influence of the sigmoid function parameters on the speed of backpropagation learning. International Workshop on Artificial Neural Networks, 195–201.

6.???? Hochreiter, S. (1997). Long Short-term Memory. Neural Computation MIT-Press.

7.???? Jacobs, L. F. (2003). The evolution of the cognitive map. Brain, Behavior and Evolution, 62(2), 128–139. https://doi.org/10.1159/000072443

8.???? Jones, E., & Steinhardt, J. (2022). Capturing failures of large language models via human cognitive biases. Advances in Neural Information Processing Systems, 35, 11785–11799.

9.???? Katz, W. T., Snell, J. W., & Merickel, M. B. (1992). [29] Artificial neural networks. In Methods in enzymology (Vol. 210, pp. 610–636). Elsevier.

10.? Loss, A. (2024). From neural networks to transformers: The evolution of machine learning. https://www.dataversity.net/from-neural-networks-to-transformers-the-evolution-of-machine-learning/

11.? Murgia, M. (2023). Transformers: The google scientists who pioneered an AI revolution. Financial Times. https://www.ft.com/content/37bb01af-ee46-4483-982f-ef3921436a50

12.? O’Shea, K. (2015). An introduction to convolutional neural networks. ArXiv Preprint ArXiv:1511.08458.

13.? Rumelhart, D., Durbin, R., Golden, R., & Chauvin, Y. (1996). Back Propagation. Institute for Cognitive Science, c-015, University of California, San Diego, La Jolla, California, 92093.

14.? Rumelhart, D. E., Durbin, R., Golden, R., & Chauvin, Y. (n.d.). The Basic Theory.

15.? Schmidt, R. M. (2019). Recurrent neural networks (rnns): A gentle introduction and overview. ArXiv Preprint ArXiv:1912.05911.

16.? Talathi, S. S., & Vartak, A. (2015). Improving performance of recurrent neural network with relu nonlinearity. ArXiv Preprint ArXiv:1511.03771.

17.? Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?., & Polosukhin, I. (n.d.). Attention Is All You Need.

18.? W, T. J. (2024). Transformers in AI: The attention timeline, from the 1990s to present. https://pub.towardsai.net/transformers-in-ai-the-attention-timeline-from-the-1990s-to-present-3702e53de184

19.? Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. https://doi.org/10.1109/5.58337

?

要查看或添加评论,请登录

Veer Ji Wangoo的更多文章

  • Predicting Poverty and its Alleviation using AI

    Predicting Poverty and its Alleviation using AI

    Introduction This Short paper presents a perspective on application of AI innovations in real time prediction and…

    2 条评论
  • #GenAI New unknown Guest or New Friend ?

    #GenAI New unknown Guest or New Friend ?

    What Smart phones did to the conventional Telephony is what #GenAI would be doing to your #EnterpriseIT in next 5…

  • Box 4 Box is Like Eye 4 Eye !

    Box 4 Box is Like Eye 4 Eye !

    "An eye for an eye will make the whole world blind" Mahatma Gandhi inspires me to write a line further "A box for a box…

    6 条评论
  • What is my #BuildingOnBelief ?

    What is my #BuildingOnBelief ?

    #BuildingOnBelief is a prudent act of questioning, correcting & reinvigorating the sense of purpose among ourselves…

    18 条评论
  • Evolution to Digital Government

    Evolution to Digital Government

    Monarchs, war lords, kings & despots have ruled this earth for many millennia, only to be replaced by Republican…

    4 条评论
  • From 90s to 20s

    From 90s to 20s

    By the end of this week, many things will be published about 2020 & a lot will be remembered too. When I see it through…

    12 条评论
  • Skills, System & Selfie Stick !

    Skills, System & Selfie Stick !

    Well at the outset, one may wonder the relationship between selfie stick, system and ones skills but let me explain…

    6 条评论
  • Infrastructure: Acknowledge the Creativity behind Commodity

    Infrastructure: Acknowledge the Creativity behind Commodity

    Almost a decade back, I first heard about a sharp statement from my sales colleague, wherein, it was referred that…

    6 条评论
  • Bird Netting & Market Approach

    Bird Netting & Market Approach

    It was a fantastic sunny afternoon and we couldn't resist going out for a cup of coffee. After 4 months along with my…

    23 条评论
  • Half Year Lessons : 2020

    Half Year Lessons : 2020

    The Greatest Good News of this year so far is that half year of the 2020 is over & just another half is left ! The…

    23 条评论

社区洞察

其他会员也浏览了