Chapter 2: Transformer architecture simplified: Neural Networks.
Continuing on from my first article:
I'll now try and explain the second motor of the transformer architecture: The neural network. The neural network tackles a huge challenge in our goal to create fluent and frictionless interaction between humans and computers: "Human mess".
I think we can all agree that we are a generally messy species, we don’t really operate with our surroundings in a structured way, and we certainly don’t have anything resembling the logic gates or pathways of a circuit printed on silicon. A computer on the other hand, is built on a solid foundation of mathematics, machine code and programming languages.
For a computer to be able to interact successfully with humans it needs to be able to: Understand the subtleties of human language & It needs to be able to find structure and patterns in our "out of context noise".
Here's a wonderful series of films on how a computer works if you want to dig into how differently a computer is structured. If you have a month or two to kill: Highly recommended:
For humans, it's a lot more complicated. Unfortunately, you might have to reserve a lifetime of study, and still be fine with the fact that you die without finding any answers. However, for people who enjoy the journey more than the destination: this is a great series of philosophers and scientists talking about the concept of complexity and consciousness.
Whether it's by accident or design we humans just don’t “do” structure very well and this is exactly where neural networks shine, they are amazing tools for dealing with the incredible amounts of unstructured data humans produce. ??
Side note before we continue: For anyone that might think that a "neural" network actually works like a human brain: No, they do not” and to my knowledge no one working in AI or neurology thinks they do.
Lets start with an everyday challenge to show how a neural network untangles our mess.
The challenge: Tagging family members in an online photo album #mum and #dad.
(Facial recognition by any other name).
This is a really handy feature for organizing all those thousands of photos we take every day, let's kick things off.
Here’s a photo of my favorite pensioners:
so this picture is taken on a new iPhone, its high resolution, HDR, there are a lot of objects in this photo (Australian national parks aren't very structured places), there are millions of pixels and colors and to make life even more difficult for our neural network there are two people that it's never seen before.
The network starts the journey by asking for some human help. It asks you, the user, to identify and tag a set of photos of faces that it thinks line up with the people in the photo.
Once you have done that the neural network has input (the photo) and an outcome (photos + #mum and #dad), now all it has to do is try and learn the rules and representations that made that outcome possible.
If you are technical, here are some great videos on this subject:
https://youtu.be/HGwBXDKFk9I?si=KC6dZE75wCwRWpSu (Intro maths of a CNN)
https://youtu.be/N_W4EYtsa10?si=yfpK_peYXb6148mt (Python face recognition walkthrough)
Before we continue, we need to define the basics, otherwise the next steps won't make any sense at all. There are three layers of abstraction that we need to go through before we get to neural networks.
Francois Chalot’s fantastic book “Deep learning with Python” gives us this start diagram:
Level 1: Artificial Intelligence:
We’ll start in 1956 with John McCarthy using the term “AI” at a conference.
I always think of this as a term from an American 1950’s sci-fi novel or film. I mean that was also the general vibe in the 2nd half of the 50’s, AI embodies a combination of techno-optimism + we won the war + space flight + Aliens from Hollywood. ?Now it's: “an umbrella term for computer software that mimics human cognition in order to perform complex tasks and learn from them”.
This is just my humble opinion: but Intelligence is a concept we barely understand, humans being able to fluently interact with a computer is revolutionary enough.
2. Machine Learning
The second step in the puzzle is machine learning. ?The first key characteristics of machine learning is it's all about data and the second is that it isn’t programming, it’s training. And what do we train? We train a model to meaningfully transform our data, to become adaptable by learning rules, patterns and representations from our data.
So What’s a Representation? It’s just a way of looking at data.
For example, a personal budget for the month of January could be represented as a table in an excel sheet, it could be represented graphically as a pie chart or it could be represented in audio form:
Machine learning is the part of the puzzle where we see patterns in data, correlations, probabilities. If you’ve worked with a data team or large amounts of data for work, study or just for kicks chances are you’ve used machine learning. It’s also great for sorting through huge chunks of “human mess” and finding patterns that would be impossible for us to discover.
3. Deep Learning (Deep Neural Network):
?If you want to look at the basic math: These YouTube films are legendary.
领英推荐
If you don't, well let's get started :) This is how a deep neural network is drawn in thousands of textbooks:
?First up we have 3 kinds of layers:
Neurons: The dots. These are the little processing units of our network; they make the calculations and help decide whether a piece of data is going to move forward into the next layer. ?
Weights: The Lines. These determine the strength or direction of the influence one neuron has on another.
The word “Deep” refers to the passage of information through the layers of the neural network.
And the word “Learning” is a combination of two processes:
Got it? Okay now let's go back to our photo and try and use the network to recognise the faces.
Layer 1: we start by flattening every pixel around the faces in the photo into a series of numbers on a grid. (I'm going to skip bounding boxes and CNN's, - not the news sender- this is a non-technical, high-level example)
Ever see a pixel art coloring in book that managers use to avoid a burnout? that's kind of how the network wants to see the image. We go from a photo to pixels to numbers.
What are those numbers based on? It could be a lot of things but let's say for now that it’s a shade of grey, something between black and white.
0 is white or nothing
0.1 is light grey
0.2 is grey
0.3 is darker grey
Bla bla bla
1 is black
Once we have the numbers, we stack them vertically and we officially have our first layer.
(example: MNIST, its not our use case but it gives you an idea of that first step)
Layer 2: looks for combinations of black pixels or darker numbers next to white pixels or light pixels, (positive or negative space for the art students out there) which can be assumed to make up the edge of an object.
Once we have the edges of objects, we start to see the beginning of a pattern in all those numbers.
Layer 3: looks for shapes on the grid, so a line or a curve. Think of this layer as though you are playing a game of Battleship, 5a to 10a on the grid all have the same or very similar color value, it seems to be a line! (you sunk my cruiser!).
Now we don't just have the edges and outlines of objects, we also have shapes, lines and curves.
Layer 4: can go a step further and those pixels, edges and lines all get combined into more complicated representations of a nose, eyes, mouth etc. These more advanced combinations of numbers on a grid get close to the expected combination and so with 4 steps we are ready to test our network against the output layer.
The output layer has a photo that has been tagged by us, it knows the result already. (this is one of the key characteristics of machine learning and deep learning, we give them the input and output, we train them, we don't program them) it then looks at the numbered patterns of that output and what the network said the input photo was. How close did it get? 30%,60% ?? That’s not good enough, we click thumbs down and curse AI’s limitations…let’s go back and tweak some of those weights (and some biases) and see if we can improve the score. (and then do it again and again) ?
When people talk about training a neural network it’s this back and forth + tweaking and human feedback.
I know the neural network of my photo app is working when it can look at photos of my folks it’s never seen before and successfully tag them.
Hopefully you have a basic idea of a simple neural network and how information moves forwards through its layers and then back again, adjusting until it finally gets it right. This layered architecture works very well with human mess, each step allows it to get closer and closer to the structure it needs to be able to effectively and computationally operate.
Now what's that feed forward network that works in all those Matryoshka-like black boxes of a transformer architecture?
Have a guess…that’s right it doesn’t move backwards, it’s a pre-trained network, so it doesnt have to go back and forth and get it right. It picks up the embedding and it pushes it through pathways that have been worked out during the training of the model.
Which pathways those are and why is something I'll explain in the next chapter :-)
Hope this was clear and if you have any questions or remarks, let me know !