Forever Artificial Intelligence Summer: a not-too-technical explanation of GenAI
Figure 1

Forever Artificial Intelligence Summer: a not-too-technical explanation of GenAI

“You shall know a word by the company it keeps” – John Rupert Firth
“Do submarines swim?”- Noam Chomsky

A summer project.

It was a long journey for the computer to make some real progress on the Turing Test if not outright pass it. But the world was exposed to it in 2022 with the public UI of ChatGPT, or was it more fully realized in 2023 with GPT-4? Or maybe not (closer but not quite).

We are far from the 1956 meeting at Dartmouth College, where participants thought AI could significantly advance in two months. In 1966, Marvin Minsky, one of the participants, believed that computer vision would be a summer project for an undergraduate. Winters came, and summers came until the current colossal boom. Finally, AI is ready for prime time. Finally, it's going to be a forever summer. It is in the hands of the consumer. Job losses and industries and all, it is the best time to be a Data Scientist.

AI or Artificial intelligence is the field whose goal is to emulate intelligence in a machine. This definition makes a significant leap in assuming that the intelligence known to humans since their last jump on the evolutionary scale does not reside in a machine. Some philosophers and physicists aside, a prevalent opinion in the Computer Science community is that the brain is a computer. But be it as it may, the field of artificial intelligence, starting at that conference, took it upon itself to figure out how intelligence can be created in the computer.

From Neuron to Computation

Early attempts took their inspiration from the brain. And this is how the artificial neuron was invented. In essence, and after some refinements, a "perceptron" is a simulation of a biological neuron. It is a function that maps an input to an output (or a sequence of numbers to a number). It takes several numbers as inputs, which we call the features or x (from 1 to n). Multiply each number by another number, which we call the weights or w (from 1 to n). Then, sum the results together, adding an extra number known as bias (empirically shown to be useful). It will then pass the result (a single number) through an activation function. There are many suitable such functions. The rectified linear activation function (ReLU) is a recent and popular one. An example ReLU function could map a positive number to itself while zeroing any negative number. Activation functions are necessary to create non-linearities in the network that will be comprised of multiple neurons. Without them, a network of artificial neurons is mathematically equivalent to a single neuron. The final result is a number that is the output of the perceptron.

Figure 2

In their book "The Perceptron," Marvin Minsky and Seymour Papert proved that a single perceptron could not compute the XOR function. XOR is a particular type of logical gate that returns “True” only if its two binary inputs differ (True and False or vice versa). This ushered in a winter for artificial neurons. A few researchers kept poking at how to simulate the brain, but the rest moved on to methods based on logic and determinism. These methods sought to recreate intelligence by programming rules into a computer.

Figure 3

Today, we realize the extent of the second group’s grand failure to achieve practical artificial intelligence. Their research resulted in some valuable technologies for modern-day programmers, such as object-oriented and functional programming. Some successes were made, most notably when IBM’s Deep Blue beat Gary Kasparov in chess, and subsequently, a hybrid traditional and Machine learning version won at Jeopardy. However, They ultimately failed because our brains almost subconsciously perform various intelligence tasks.

In his book “Thinking Fast and Slow” the cognitive psychologist Daniel Kahneman divides our thinking into two categories. System 1 is the pattern recognition type of thinking, for example, recognizing a face, smell, or cliché. The second category is System 2, which involves deliberate thinking, logical reasoning, and reflection.

We do not have to think much about how we recognize a face when we see one – it is system 1. This also means we can't describe a face faithfully in a computer language. When a person describes a face to a sketch artist, the person and the artist understand what a face looks like. They use some basic shared knowledge about the face's basic features and then customize this knowledge to draw the specific face. Yet, having a computer draw a face will require starting with no shared knowledge. What is termed today as Old-Fashioned techniques for AI would be very lengthy and descriptive in terms of computer language and ultimately would reach a much lower accuracy than today's Neural Network models.

Show versus tell.

The story would unfold because of the few pioneers who stuck with what used to be known as connectionism (Neural network or Deep Learning today). People like Geoff Hinton, Yann LeCun, Yosha Bengio, and others. These mavericks kept refining and moving forward with neural network theory. In the late 80s to early nineties, a pivotal theorem was discovered. The Universal Approximation Theorem states that a network of perceptrons with a single hidden layer can approximate any continuous function. This meant a neural network with only three levels (the hidden layer is the middle layer) and enough neurons could be Turing Complete (i.e., do everything a computer program can).

Figure 4

Yet, how do you get the neural network to do anything useful? This is where the technique popularized by Geoff Hinton brought him fame (he, Yann LeCun, and Yoshua Bengio would later share a Turing award). A network of multiple neurons, fully connected, will take numbers as inputs and output another set of numbers. We need to, however, adjust the weights of this network to compute the function we expect. Manual or random adjustments can only get us so far and might be impossible for large networks. This is where backpropagation aimed to "train" this network by adjusting the weights intelligently to learn how to classify data correctly.

Picking the XOR function, say you want to train the network to return 0 (representing False) when its two inputs are 1 (representing True) or 0. And for the network to return 1 when the two inputs are different. First, you need to craft the shape of the network and its connections – also known as modeling the network. Model design is an art and a science. Digital logic intuitions work well for the design of the XOR network. We create a network of three layers as follows. The first layer will have two neurons for the two binary inputs. The second middle layer (or hidden layer) will also have two neurons with an activation function. We will use the sigmoid activation function (this is a function that will take any input and squish it into a range between 0 and 1). And they connect to an output neuron that also uses the sigmoid activation function.

Figure 5

And now we get to training it. While traditional AI relies on human programmers to tell the algorithms precisely what to do, machine learning algorithms are taught by Data what to do. Namely, in what is called Supervised Learning, the data they take is in the form of multiple feature and label pairs. A feature is the input, and the label is the output. For the logical XOR, it will be the values of its truth table. The first time, we initialize the weights to a random value (an empirical trick). Then, if we feed two values into this network, it will output some random value. We will teach it to return the correct output for a given input. The training will need a way to guide it to what is the correct output.

This is where the loss function comes in. It’s a way to compare the actual output to the expected output. We will use the Mean Square Error function. We take the difference between an observed output and an expected output, square it, and sum over all the values of the differences for all expected and actual outputs for the training set. We know the loss function is zero when every output matches the expected output.

We can then use the chain rule of calculus, and the derivative gradually adjusts every weight with respect to the loss function. The backpropagation method with gradient descent can know the contribution of every weight in the network to the error of the loss function and adjust it slightly for an iteration of the algorithm. The algorithm will run for several iterations until the loss function reaches its minimum. It should be mentioned that gradient descent is a method to follow the derivatives to the minimum of a function (in this case, the loss function). It approximates the numerical techniques you might have used during an introductory high school or university optimization course.

Figure 6

We have trained a basic network to compute the XOR, and we did it by getting to the point that minimizes the loss function. In a more general setting, where e.g., we want to classify images of cats and dogs, we do not look to get to the minimum or fit the data perfectly. This is because if we provide new images to the algorithm, it will not generalize to it. A good analogy is studying for a test where one memorizes all the answers to the previous exam questions instead of learning the subject's basic concepts.

Figure 7

Machine learning algorithms need to be good pupils. As such, they use techniques known as regularization that help them better learn the underlying distribution of data and not memorize the answers for every point. Not surprisingly, the most successful regularization technique for Deep Learning was dropout, developed by Geoff Hinton. It is not fully understood why it works (just like much in Deep Learning). However, an excellent general grasp of these techniques is that they add some factor to the loss function or some smoothness to the learning function; in doing so, they eliminate the noise in the data points to arrive at the underlying function of the data distribution.

Models and architectures

We touched upon modeling a neural network. However, as we know from the universal approximation theorem, a sufficiently large, fully connected deep network can approximate any continuous function. So why do we develop particular models for neural networks? The answer is that by modeling to the problem we are solving, we will create architectural designs that can optimize the learning process and make it computationally feasible. Rather than using a generic deep network for all tasks, specialized architectures, like convolutional neural networks (CNNs) for image recognition or recurrent neural networks (RNNs) for sequential data, have been developed to exploit patterns and structures inherent to specific data types.

Figure 8

The earliest was the Convolutional neural network (CNN) pioneered by Yann LeCun. This network will organize neurons into sheets resembling an image or a screen. This sheet will capture (or get activated by) the pixels in the image through operations called convolutions. The sheets are stacked together one after the other, getting narrower and narrower. They gradually extract the features that will activate the layer. The network will be activated through gradual layers to classify an image into the proper category. CNNs had a lot of success in the nineties for optical character recognition. However, a business decision years later by the CEO and co-founder of Nvidia, Jensen Huang, would bring CNNs into the mainstream and usher in the age of Deep Learning.

In 2006, Huang allowed developers to access and program his graphical processing units (GPUs) through a language extension to C/C++ known as CUDA. This was a minor event at the time. However, GPU's unique usefulness for linear algebra operations would be utilized in 2012 during a competition known as ImageNet.

AI pioneer Fei Fei Li created a database of 14 million labeled images and launched a competition in 2010 to advance the state of image recognition. In 2012, a team from the University of Toronto, spearheaded by Geoff Hinton, won this competition with a 10% error rate reduction over the second-place competition winners who were using more traditional techniques. The Canadian team used two Nvidia GPUs on what seemed to be a gaming PC. Their architecture was a Convolutional neural network. This brought neural networks to the attention of the technology giants. Geoff Hinton would sell his company to Google and be hired by it. A number of Deep Learning and Neural network experts would leave academia to work for Big Tech.

The technology companies have since consistently applied, shown the advantage, and integrated Deep Learning into their products. However, the average consumer could be forgiven for a na?ve view of incrementalism. All this changed with a tweet on November 30, 2022, from a group called OpenAI. It read, "Try talking with ChatGPT, our new AI system optimized for dialogue. Your feedback will help us improve it." It changed the whole landscape. Within four days, it reached 1 million users. Even though it was an early version, it showcased a chatbot that could pass the Turing Test for the first time, and not in a simple form.

Figure 9

The Transformer rolls out

To understand the revolution in 2022, we must go back five years to a paper titled "Attention is all you need" by Vaswani et al. from Google Research. The paper introduced a new Neural Network architecture dubbed the Transformer. Until then, the community's consensus was that different recurrent network architectures should be used for Natural Language Processing (NLP). Vaswani and Co. came up with a new network architecture that is highly parallelizable. This meant you could run it on massive cloud computers and scale horizontally.

A Transformer comprises three major components: the embedding, attention (or rather self-attention), and fully connected layers. It brings many functions together—the ability to understand a word relative to other words. For instance, it can understand that “woman” and “lady” are closer in meaning than “man” and “guy”. It can understand which words in a sentence (or whole passage) are closer in meaning to others and which ones are farther. It can understand a word in the proper context. For instance, if you tell it: "The trophy did not fit the case because it was smaller." It will understand that the case was smaller than the trophy. It can also generate new words, classify the sentences it has seen, or do various tasks in NLP. So, let's break the Transformer apart.

Figure 10

All you need is to pay attention.

We start with the embedding layer. Word embeddings have been around for a while. This method converts words into a series of numbers (also called vectors). The numerical values in the vectors represent semantic meanings and relationships between words. You can think of it as projecting words into a multi-dimensional space. Words that have similar meanings will follow one another. And words will have direction and intentionality. And with these embeddings, you can do math on meaning, which is the key to unlocking language. A famous example is an SAT-type analogy sentence: King is to man what queen is to woman. Expressed in terms of vector embeddings, we get "King – man + woman = Queen." Word2Vec and GloVe are popular word embedding techniques. A side but important note is that we add a positional embedding to those embeddings of words that indicate where they are located in the sentence. This way, the Transformer will keep the order in mind since words are processed in parallel.

Figure 11

The attention mechanism is a pivotal component of the Transformer. It determines which parts of the input (e.g., a sentence) should be focused on at any given time. When we humans read a long paragraph for example, our brain naturally pays more "attention" to certain essential words or phrases. The attention layer in transformers does this in the network. The mechanism allows the model to weigh the importance of different words around a given word. This enables the Transformer to capture long-distance dependencies and relationships in the data, which was challenging for older models. This helps figure out the meaning of the word in the context. Consider again the example: "The trophy did not fit into the suitcase because it was too big." In this case, the attention to the pronoun "it" and the adjective "big" with respect to the word "trophy" will be the highest.

If word embedding gave meaning to individual words, now, with attention, we get contextual meaning for sequences and interrelatedness. All this is made possible by the fact that we can now do math on words after they got projected into vectors in the high-dimensional embedding space. In fact, the attention mechanism is also a mathematical operation, taking a context word and all the surrounding words. ?And we usually have multiple attention heads in parallel - multi-headed attention. This way, we pay attention to various parts of the sentence.

Figure 12

Finally, there's the fully connected layer feed-forward network (FFN). After processing the input data through the embedding and attention layers, the transformed data passes through the FFN. This network is the basic neural network we encountered at the start of the essay. A network of multiple neurons connected all to other neurons with hidden layers. It performs specific transformations to produce the desired output. We also note that numerous transformers are stacked together to refine and produce better outputs.

Figure 13

The revolutionary aspect of the transformer architecture is that it can handle a vast amount of data in parallel. This parallel processing capability made it feasible to train much larger models, leading to significantly improved performance on various tasks.

Following the introduction of the Transformer, models like BERT, GPT (Generative Pre-trained Transformers), and their subsequent iterations emerged, becoming state-of-the-art for Natural Language Processing (NLP). These models were pre-trained on enormous datasets and then fine-tuned for specific tasks, leading to unparalleled performance improvements.

Tuning your instrument

That was in 2017. What went on during the next five years? Part of the answer is the cloud and bigger and more refined models. But the other part lies in every data scientist's bane and life reason: fine-tuning. When the Transformer was first invented at Google, the task was to translate from two different languages. There was this oddity where it could also be used to continue a piece of text. I could start writing this essay and put it into a transformer model. It would continue it, undirected, and take it to whichever way its training dictated. When popular media articles say that ChatGPT is only predicting the next word, that is what is happening. Only it got fine-tuned to predict it in a way as to answer a question. This was achieved through a four-stage process, as described in a video by Andrej Karpathy at Microsoft Build 2023.

Figure 14

In the first stage, pre-training, the GPT network is trained on a dataset as big as the internet, supplanted with books and other data. The main idea here is language modeling, and the training process will mask parts of sentences and force the network to predict the masked words. We can consider the problem a Supervised Learning (classification) problem where the labels are the predicted words, and the features are the current words. By doing this, the Transformer will learn the word distribution of the data it is fed. So it can continue sentences. But the output will be unpredictable. We note the economical nature of this approach, as we can algorithmically pre-process the data and don't need to label it manually.

An early discovery was that if you provide such a model with a question, it will sometimes answer it. It will perceive the answer as continuing the question and go on. Other times, it will respond with another question. So, the next stage is to feed the model a series of high-quality questions. Operators refined questions from sites like Quora. This stage will increase the likelihood of the model responding with questions.

But we are far from done. The model might still spew nonsense or harmful content. To understand this better, we take a detour back to the world of word embeddings. Sexism was discovered in early models. For instance, if you say: "Man is to doctor what woman is to…" You would get "nurse" in the earlier models. Intuitively, we can make sense of this as in the early texts on the internet and in books, a doctor was most likely a man and a nurse a woman. Techniques use linear algebra and others to de-bias those embeddings. What applies to embeddings applies to the high-level contextual text returned by the Transformer.

Ring my bell

The next step in fine-tuning proceeds to use a technique known as Reinforcement Learning with Human Feedback (RLHF). Reinforcement Learning is another artificial intelligence technique inspired by behaviorism and experiments such as Pavlov's dog. Its most recent incarnation, which, among other things, won the game of Go, uses a neural network but proceeds to update its weights by following a reward function. A good example is the video game Super Mario. A reinforcement learning algorithm to play Super Mario will, on training, update its weights for every action that increases the score in the game. Therefore, it will become an expert Mario player after thousands or millions of iterations of self-play.

Figure 15

RLHF, as applied to GPT, created a neural network trained on the inputs and outputs of the pre-trained transformer models. Then, human labelers proceeded to teach and classify this neural network to have become apt at classifying the high-score and low-score outputs. Outputs like conspiracy theories and hate speech would get low scores. Whereas more factual and politically correct output would get higher scores. Then, reinforcement learning was used to self-train the Transformer using the neural network as a reward function. The bad got weeded out. The good was reinforced. In this way, the Transformer was fine-tuned to answer questions as expected by OpenAI. A good way to think about this is to go to a tailor, choose a costume, and then have him adjust it to your body.

One final technical detail we note about ChatGPT is how it achieves continuous conversation. The transformer predicts and generates one word at a time (at least in more simple models). Then, that word is fed with all the preceding words for the next word to be generated until an "end token" emerges (you can think of tokens as stand-ins for words and other input/outputs). For conversationality, the transformer is fed the previous conversation up to a token limit back to itself by the ChatGPT backend (this is not the network, but a programmatic backend). It will then respond by generating a new word, and so on.

The blind will see

Transformers have not been limited to just natural language and sequence processing. They have also proven effective in vision, image recognition, and image generation tasks. Contrary to common belief, the previously mentioned convolutional neural networks (CNNs) can be considered a specific case of transformers. In the early days, around 2015, image generation was dominated by Generative Adversarial Networks (GANs), an innovation by Ian Goodfellow. These GANs utilized a combination of generator and discriminator techniques, underpinned by principles of game theory, to understand the primary features of faces or objects. They could subsequently produce numerous images using this methodology.

However, GANs lost popularity due to their instability, training challenges, and scaling issues. A new model type, the diffusion process model, emerged as a frontrunner. This approach primarily involved introducing noise to an image during training and generating images from noise during inference. These models, among others, operate based on deciphering a distribution's latent variables. One can draw a parallel with Plato's allegory of the cave to understand latent variables.

In The Republic, Plato tells a story of prisoners who have been chained inside a cave their entire lives, facing a blank wall. They cannot turn their heads and only see the shadows cast on the wall from objects behind them, which are illuminated by the light of a fire. These prisoners believe these shadows are the only reality, as they have never seen the outside world or the objects casting the shadows. One day, a prisoner is freed and sees the real objects and eventually the outside world, realizing his limited perspective in the cave.

Figure 16

Latent variables can be thought of as the unseen objects that cast shadows on the wall. We don't observe these latent variables in many models, like generative models like GANs. Instead, we observe the data generated by these variables, like how the prisoners only see the shadows (the projections or manifestations of the real objects). For instance, for a cat generation task, we observe the various cats but not the underlying features that make a cat a cat.

When we train a model, we're trying to infer or learn about these latent variables based on the observable data, much like how the freed prisoner learns about the natural world outside the cave by leaving it. Learning these latent variables is crucial because they capture the underlying, hidden structures or features in our data that generate the observable patterns.

The revolution was Xed.

Intelligence for some Deep Learning practitioners seems to be a function, and by refining a neural network’s weights and structure, they seek to simulate this function. This quest is pursued without the understanding of intelligence from a scientific standpoint. A good description was given years ago by the linguist Noam Chomsky. He explained that if we were to study the physics of climate, we would be able to predict the weather through physical simulation models and measurements (the way it is usually done). Another way would be to observe daily how the weather is at a particular place and then build a predictive model. He said this to express skepticism about machine learning models for intelligence, which he expressed again in an essay published in the New York Times.

Noam Chomsky is sometimes vindicated in his judgment of Artificial Neural networks and specifically probabilistic models of AI whenever we chat with a Large Language Model (LLM), and it hallucinates an answer. Long-time users have become more familiar with these models' mental health than people's. Indeed, a constant failure of these models is on Math tasks and other higher reasoning tasks. While they are excellent at System 1 thinking, they have some System 2 limitations (this author believes that the old-fashioned ways might still have their say on System 2 thinking). However, it is undeniable that with the release of ChatGPT, we reached a significant milestone in creating intelligence.

The journey from a conference to a tweet is also a testament to the challenges faced - from data biases to ensuring responsible outputs. The future, hopeful and scary, might promise even more advanced and refined AI models. The journey is undoubtedly impressive. From Dartmouth to OpenAI through Google and Canadian university research. This AI boom is for real this time around. And this future is not only in the hands of Big Tech. With the release of LLAMA in 2023, an open-source model from Meta, and multiple subsequent models released, open-source large language models have democratized the landscape. Even though the current models are not as powerful as the closed source models from OpenAI and Anthropic (a branch out formed by OpenAI former researchers and now funded by Amazon), they promise to level the field and, if nothing else, allow the gathering army of Data Scientists to sharpen their skills and offer the world new products.

Figure 17

In ending, we note the risks, most notably, the singularity or Strong Artificial Intelligence (when AIs are smart like people and then smarter). We still didn't know much about the technology as we tried to peek under the hood and see the engine. How does the working of well-understood math produce ill-understood intelligence? And with the unknown comes fear. If Strong AI is reached and a malevolent entity comes out on the other end, could it spell the end of the human race? Some of the AI pioneers, such as Geoff Hinton, have started to raise the existential alarms around AI. But the Pandora box is opened. I am optimistic about the technology, but within reason, safety and regulation are needed to ensure we are well-prepared for what is next.

Figure 18

ChatGPT was used for review and some editing, as well as Grammarly. The Images have been referenced. This is just a bit of the learning needed to understand Deep Learning and GenAI. A book I am currently reading which gives more in depth treatment is "Dive into Deep Learning."

Keep Learning never stop growing!

Figure Sources:

1- DALL·E 3

2- https://www.dhirubhai.net/pulse/weight-bias-ann-bhupendra-kumar-tak/

3- https://www.build-electronic-circuits.com/xor-gate/

4- https://medium.com/swlh/fully-connected-vs-convolutional-neural-networks-813ca7bc6ee5

5- https://machinethink.net/blog/the-hello-world-of-neural-networks/

6- https://towardsdatascience.com/mastering-logistic-regression-3e502686f0ae

7- https://ai.plainenglish.io/l1-and-l2-regularization-lasso-and-ridge-in-linear-regression-a83b6fe07bf8

8- https://raihanrnj.medium.com/deep-learning-simple-image-classification-using-convolutional-neural-network-dog-and-cat-8c99aef29e8

9- Twitter, X or Musk Social ;-)

10- https://arxiv.org/abs/1706.03762

11- https://www.researchgate.net/figure/Visualization-of-the-word-embedding-space_fig4_343595281

12- https://jalammar.github.io/illustrated-transformer/

13- https://www.mdpi.com/1422-0067/24/3/2814

14- https://medium.com/@chassweeting/the-state-of-gpt-by-andrew-kaparthy-fad2f007c1b9

15- https://www.youtube.com/watch?app=desktop&v=qv6UVOQ0F44

16- https://avi-loeb.medium.com/interstellar-interpretation-of-platos-cave-allegory-b74f24b1a5c

17- DALL·E 3

18- DALL·E 3

















Sascha Grasshoff

Managing Consultant at IBM | ITIL 4 Managing Professional | IREB CPRE | Enterprise Architecture | Former Army Officer | M. Sc.

1 年

Your article is a very good read, Leon Lahoud. Thanks a lot for sharing your insights, clearing up much of the mist I had been looking at during my learning journey on AI. ????

Kiranshankar Paul

Application Architect at IBM Microsoft Garage

1 年

very detailed and well explained Article. thanks for sharing..

Thiru Venkatachalam

VP & Senior Partner - IBM Hybrid Cloud Consulting

1 年

fantastic article Leon. even after several months of learning, some of the concepts got clarified further in the article and more to go. well written and thank you

要查看或添加评论,请登录

社区洞察

其他会员也浏览了