How Does AI Pay Attention?

How Does AI Pay Attention?

The animal didn't cross the street because it was too tired.

What does ‘it’ refer to in this sentence?? You don’t have to think twice to know the answer is ‘the animal’, and not ‘the street’.?

Now, what about this sentence?

The animal didn't cross the street because it was too wide.

Here, you also don’t have to think twice to know ‘it’ refers to ‘the street’ and not ‘the animal’.’

Properly processing these sentences requires your mind to consider the context and meaning of the words - no matter how far apart they are in the sentence(s). Thus, it should be clear that the longer the sentence(s), the more difficult this can be.

Interestingly, I asked Microsoft Copilot to generate a longer sentence example but, despite multiple attempts and strong feedback from me, it couldn’t do it correctly.? You might want to try it with your favorite Chatbot.? If you get a good example, please post it in the comments.

However, the point is that making these word connections is not as obvious to a computer as it is to you, which is why AI models need an attention mechanism.

Attention Is All You Need , a 2017 paper by six Google researchers and two academic researchers who were working at Google Brain and Google Research at the time, is recognized as one of the most important AI papers published in recent years.? Its abstract begins with:

“The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.”

Before discussing attention, we need to discuss Neural Networks. They are modelled loosely after the human brain in which some 86 billion neurons on average have 7,000 synaptic connections each to other neurons. An artificial neural network (ANN) consists of layers of artificial neurons (nodes) - an input layer, one or more hidden layers, and an output layer. Each node connects to others and has an associated weight determined by the training. It may also have an activation function that is applied to its output allowing the network to learn and represent complex patterns in the data.? We will not delve into the details of these functions.

To illustrate this architecture, we will use an ANN you can interact with from But what is a neural network? a set of six videos from 3Blue1Brown , Grant Sanderson’s great site about discovery and creativity in math, with an emphasis on visualizations.? I can’t recommend it highly enough!?

The network was trained to recognize handwritten digits from 0 to 9 using MNIST , a database of 60,000 examples of handwritten digits, each labelled with the number it represents, along with a labelled test set of 10,000 examples.

As shown here, the network’s input layer consists of 784 neurons in a 28x28 array, each representing a ‘pixel’ of the input image where digits are hand drawn.??

The input layer receives the ‘ground truth’ as represented by the grey scale value of 0-1 of the 784 pixels of the hand drawn digits.

Each neuron in this simple model holds a number from 0 to 1 (its activation) representing the grey scale value of the pixel it represents (.58 in this example) - 0 for black to 1 for white. There are 10 neurons in the output layer – one for each of the base 10 digits 0 to 9.?

The two hidden layers, shown in the next graphic, constitute the ‘black box’ where each neuron transforms the inputs from the previous layer and passes the result to the next layer. We can’t see exactly how that is done, but the strength of the connections between the neurons is based on their weights that were established in training with the MNIST dataset. In that training the model’s output is compared to the ground truth and the weights are adjusted until the output is as close to the ground truth as possible.

During training neurons also establish biases – constants that are added to the neuron’s output. Biases allow neurons to activate even when the weighted sum of their inputs does not exceed its threshold value, providing a level of adaptability that ensures the network can learn and make predictions effectively.

This layered architecture allows the network to learn hierarchical representations, with each hidden layer learning to recognize increasingly complex features. ?We will see more about this when we discuss a chest X-ray model developed here in Australia at @CSIRO’s Australian e-Health Research Centre , where I’m based while here.

In the graphic, the network correctly recognizes the hand drawn ‘9’.? It is confident, since the output neuron is white, represented by a value of 1.

The sample model consists of the 784-neuron input layer, two hidden layers, each arbitrarily with 16 neurons, and a 10-neuron output layer.

You can try this yourself using this interactive graphic located near the top of the page I’ve linked to.? I used it to create the next graphic by drawing the number 8 in the 784-pixel matrix. If you try it, I recommend maximizing the interactive graphic to make things as clear as possible.?

The interactive graphic provides a live demonstration of the ANN.? It’s from Chapter 3, Analyzing our neural network, from

So, we have the necessary background to discuss the attention mechanism that focuses the model on the importance of key tokens in the input sequence by altering the token embeddings.

First, what are tokens and embeddings?

Tokens are the elements of the prompt – the user’s input into the model.? For text prompts, tokens are common sequences of characters found in text.? They may be entire words or parts of words.? OpenAI provides a simple tool to illustrate this. ??Using it, here are the tokens in our opening sentence.

OpenAI’s Tokenizer tool illustrates how tokens work to represent our opening sentence.

We discussed embeddings in detail in an earlier post . ?In it, I used an online embedding demonstration from Dr. Dave Touretzky of the Department of Computer Science at Carnegie Mellon University (CMU) to illustrate how embeddings create a mathematical representation of context and meaning that is at the core of generative AI.? You may wish to read or re-read it before proceeding.?

The key point is that the calculations that take place in the network between its input and output layers rest on the embeddings established during model training.

Now back to attention. ?Its mathematical, so we will mostly discuss what it does, rather than how it does it. For those of you who want a bit more, an attention model involves three main components: queries, keys, and values. It is often analogized to a retrieval system that looks for the best match to what you want.? For example, when you search for videos on Youtube, the search engine will map your?query?(text in the search bar) against a set of?keys?(video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).

There are typically key words in any sentence that establish its meaning.? For example, in ‘I am a boy ‘, ‘am’ is not as important to meaning as ‘boy’. In natural language processing attention mechanism focuses the model on the important words in a sentence, to improve the quality of the results. This is illustrated here for the two sentences we began with via a graphic from Transformer: A Novel Neural Network Architecture for Language Understanding , a web page discussing Attention is All Your Need by Jakob Uszkoreit , one of its authors

The attention mechanism has calculated connections between words, the darker the connection, the stronger their relationship.? It correctly connects ‘it’ to ‘animal’ in the first sentence and to ‘street’ in the second.

An illustration of the output of an attention mechanism. The darker line illustrates that ‘it’ refers to ‘animal’ in the opening sentence but to ‘street’ in the second one.

Keep in mind that the job of a generative AI model is to predict the next token in a sequence with tokens representing whatever the sequence consists of – and that might not be words.

To illustrate that we now turn to work done by Aaron Nicolson , Jason Dowling , and Bevan Koopman here at CSIRO 's Australian e-Health Research Centre. ? So far, we have looked at the role of attention in natural language processing?(NLP) but it is equally important in?image processing. Here attention focuses the model on specific parts of the input image that are most relevant to the task at hand. Our human visual attention mechanism does something similar when it focuses us on some aspects of our visual field while ignoring others.

The CSIRO researchers have developed CvT2DistilGPT2 , an encoder-to-decoder model developed for chest X-ray report generation and trained on 270,790 images from the MIMIC Chest X-ray (MIMIC-CXR) (MIMIC-CXR) Database that contains both de-identified Digital Imaging and Communications in Medicine (DICOM) formatted image files and the corresponding free-text radiology reports from studies performed at the Beth Israel Deaconess Medical Center in Boston, MA.?

They explain a need for a new model because “Automatically generating a report from a patient’s Chest X-rays (CXRs) is a promising solution to reducing clinical workload and improving patient care. However, current CXR report generators—which are predominantly encoder-to-decoder models—lack the diagnostic accuracy to be deployed in a clinical setting.”?

They go on to explain that “To improve CXR report generation, we investigate warm starting the encoder and decoder with recent open-source?computer vision?and?natural language processing?checkpoints, such as the Vision Transformer (ViT) and PubMedBERT [now called MSR BiomedBERT }.” Warm starting is initializing a model’s training using pre-trained weights rather than starting from scratch.? We won’t dwell on the model itself but rather what it can do and how attention plays a role in that.

First, we need to introduce cross attention which merges two different embedding sequences, that can be from different modalities, such as text and images.? In this case the words are a radiology report, and the image is the chest X-ray it is describing.

The graphic shows the chest X-ray on the right and, at the upper left, the radiologist’s interpretation - the ground truth – and, at the lower left, an alternative interpretation generated by the model.? In the generated report tokens that are parts of words are separated by the | character.

This chest x-ray has been interpreted by a radiologist (top text) and by CvT2DistilGPT2 (bottom text). Generated tokens that are word parts are separated by |.? Interesting clinical findings are in color. We will focus on CAB, previous

We will focus on the first of these – CAB, previous Coronary Artery Bypass Grafting (CABG) surgery, an interesting clinical finding, highlighted in color.

As you can see in the images that follow, they are segmented into ‘patches’ by the model. Their embeddings roughly correspond to the patches.

This series of images are from the six layers of the model, with the first layer at the left.? Progressing to the right, the layers are considered higher level representations of the lower layers that typically model more primitive relationships.

The series of images are from six layers of the model, with the first at the left.? The cross-attention mechanism colorizes the areas of interest with yellow for the highest, red for moderate and no added color for the lowest amount of attention when generating the CAB token.

You can see, that as the processing progresses from left to right, the focus is increasingly on the clips left behind in the coronary arteries by the coronary artery bypass surgeon. The small metallic clips secure the arteries, prevent bleeding and mark the surgical site.? The model learned about them and related them to CAB as a result of its cross training.?

A high-resolution graphic of the right most image from attention head 12 in layer 6. Attention heads allow the model to focus on different parts of the input sequence simultaneously. The colored pixels (yellow for highest and red for moderate attention) illustrate that the model has focused on the clips in the coronary arteries.

Cross-attention is used to decide which parts of the visual features (regions in the image) are relevant to the current token (word in the report). For example, if more attention is given to visual features associated with CABG, the model is more likely to generate a token related to CABG (e.g., the sub-word CAB, or ‘status’, or ‘post’, etc.).

Wrapping things up, earlier I said with respect to understanding the two opening sentences, “Doing that requires that your mind consider the context and meaning of the words.” If you are reading a document your focus shifts from one word to another, depending on the context. Attention mechanisms mimic this, allowing models to selectively concentrate on specific elements of the input data while ignoring others.

The earlier Recurrent Neural Networks (RNNs), mentioned in the Attention is All Your Need abstract, processed sequentially in a left-to-right or right-to-left fashion, reading one word at a time. This forces them to perform multiple steps to make decisions that depend on words far away from each other.? The further apart they are, the likelihood increases that they won’t make the right decision.?

Jakob Uszkoreit’s web page, quoted earlier, explains that “in contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position.”?

All along the way the embeddings provide meaning. As the web page states “Word embeddings are numerical representations of words, designed to capture semantic relationships between words. The idea is to map each word to a high-dimensional vector, where similar words are closer in the vector space.”

Finally, returning to Attention is All You Need, the authors (in 2017) conclude in part by saying:

“We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio and video.”

Clearly subsequent events proved them to be prescient!?


Thanks to my CSIRO colleagues, Aaron Nicolson and Jason Dowling , for their help in drafting and proofing this article.

Good to see you're still at it! Been a long time since we were at MUSC, and then working on the RX program at FPA..... Thanks for all you have done and are doing to blend & balance Health with Tech.

回复
Ming Zhun Ku

Data Scientist | HL7 | Technology & Health Industry

2 个月

Amazing article about how attention mechanisms being used in healthcare ??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了