Foundational Principles of Deep Learning – my notes
Ajay Taneja
Senior Data Engineer | Generative AI Engineer at Jaguar Land Rover | Ex - Rolls-Royce | Data Engineering, Data Science, Finite Element Methods Development, Stress Analysis, Fatigue and Fracture Mechanics
1.???Introduction
?As I had mentioned some weeks ago, I will be publishing a series of blogs which will go into the finer details of the Transformer architecture. ?Each part of the series will involve theoretical insight as well as an insight from coding point of view into every unit/ concept involving Transformer architecture as illustrated below. Each respective part will go into the details of the following:
?
However, before diving into the above ingredients, I felt that it is essential to document the following:
o??Firstly – the foundational principles of Deep Learning
o??Secondly – the evolution of Language Models
By revisiting the above 2 topics in (some) detail, the content related to the Transformer is better understood. Hence, I felt that it is definitely useful to spend time and effort into the above topics before diving into the working of Transformers (because that's how, I've been evolving my learning journey until now!).
This article will go into the details of the foundational principles of Deep Learning and will cover the following aspects:
Needless to mention, all the content discussed below is a collection of my notes from various courses that I’ve taken so far, YouTube Videos that I’ve watched and blog posts by other learners available as open source. This is documented for my own future reference, however, I’m obviously happy if it benefits anyone reading this content.
2.? Artificial Intelligence, Machine Learning and Deep Learning
?
Artificial Intelligence:
Talking of Intelligence, it means the processing of information such that we can use it to infer some future decision and action that we take. The field of Artificial Intelligence involves building of computer algorithms that do exactly the same thing: process information to infer some future decision.
Machine Learning:
Machine Learning is a subset of Artificial Intelligence that focusses specially on making a machine learn to the above based on some human/real experiences. Statistical techniques are used to enable machines to improve with experience.
Deep Learning:
Deep Learning is a subset of Machine Learning that uses multi-layered neural networks to extract patterns that occur within the data so that the network learns to perform the tasks which otherwise will require human intelligence.
This article will go into the foundational aspects of Deep Learning which focusses on making the computer learn different tasks directly from the raw data.
3.?What does Deep Learning have to offer??
Traditional Machine Learning algorithms typically define features (patterns) in the data and typically a human with expert domain knowledge will uncover these features – the key idea of Deep Learning is that instead of a human defining these features, the machine extracts patterns in the data so that it can use those to make some decisions.
For example, for a face detection algorithm, a deep neural network will learn that in order to detect the face, it first detects the line, edges which can be combined to get mid-level features like corners and curves which in turn can be combined to?in deeper layers of the neural network to form high level features like eyes, ears, node, etc. and then all these together will be able to detect?the face.
All the learning is hierarchical starting from lower layers of the network as illustrated in the figure below.
4.??The Building Block of Deep Learning: The Perceptron
?
Let us now start with the fundamental building block of every single neural network that one may develop – which is a single neuron. In the deep learning language, a single neuron is called a perceptron. The perceptron is a single neuron, and its internal state is represented by a set of inputs x1 to xn which are multiplied by the corresponding weights and added together – we also add a bias term indicated as w0 as shown in the figure below.?
Then, we take the single number after the addition and pass it through a non-linear activation function and that produces the final output of the perceptron which may be termed as y_bar as shown below.
The process is mathematically represented by the equation below:
Purpose of activation function:
The point of non-linear activation function is to introduce non-linearities to the data/ Almost all real-world data linear in nature, thus, if we want to deal with those datasets, we need models which are also non-linear so that the models can capture the kind of patterns in the data.
To understand this better, let us say we have a dataset as shown in the figure below:
Suppose given this dataset, we have to construct a decision boundary i.e., a boundary separating the red and the green dots in the figure above. Now, if were to use only a straight line to separate the green and the red points, the best we could do is to separate as shown in the figure below:
Thus, the problem cannot be solved effectively using a linear approach and we will have to resort to non-linearity which helps to deal with such types of problems. The non-linear activation functions allow us to deal with non-linear data which makes the neural networks very powerful.?
Further, it may be underscored that since we’re just multiplying the inputs with the corresponding weights and adding them together, the problem remains a linear problem until we introduce non-linearities using non-linear activation functions.
Types of non-linear activation functions:
Some types of activation functions include,
·????????Sigmoid Activation Function
·????????Tan hyperbolic/ Hyperbolic Tangent activation function
·????????ReLu activation function
These are illustrated in the figure below:
5.??Perceptron To Neural Network
?
Continuing with the above discussion: Now, let’s take the (single) perceptron and build something more striking!
Now, suppose we want 2 outputs from the function. We simply add another perceptron – this added / second perceptron will have its own sets of weights. Each perceptron controls the output of its associated piece.
?Further, such perceptron’s can be stacked to form a single layered neural network as shown below:
A deep neural network can be built by stacking more such sequential layers as shown in the figures below:
With this illustration, we can imagine/interpret that the given inputs (at the beginning) are transformed into a new dimensional space with the values closer to what we want (i.e., closer to the output we want) and this transformation has to be learnt and this is described in the next section which pertains to the loss function (or objective function).
领英推荐
6.?Training a neural network
?
Loss Function:
Having constructed the neural network (single/multi layered) and if we just start utilizing the network – with random values of weights – to predict the output, the network will not predict correctly because it’s not yet been trained. The network does not have the information of the world concerning the problem!
To train the network, we will have to construct the loss function which will tell us how far the predicted output is from the actual output. The loss of the network measures the cost incurred from incorrect prediction. The loss function is also termed as the objective function or the cost function or empirical loss and is a measure of the total loss over the entire dataset. Mathematically, the loss function is expressed as follows:
As it may be noticed from the equation above, the loss function is a function of the inputs and the weights – i.e., the predicted output and the actual output.
?
Minimizing the Loss:
Training the neural network will not only involve determining how far the predicted output is from the actual output but also minimizing the loss. Thus, mathematically we want to find the network weights that will result in the smallest loss as possible over the entire dataset. The mathematical equation is represented as follows:
Cross entropy loss
For a binary classification problem, the loss function employed is cross entropy loss denoted as below:
Mean squared error loss:
Mean squared error loss can be used for regression models that can output continuous real numbers denoted as below:
Loss optimization: How to minimize the loss?
The loss function is going to be the function of the weights – for a 2-dimensional problem, this loss function can be visualized as follows:
In the above landscape, we want to find the least loss which will correspond to the lowest point.
This is done mathematically through the following steps:
?
The above algorithm is formally termed as gradient descent. Formally, the steps in the Gradient Descent algorithm may be highlighted as follows:
1) Compute gradient
2) Update weights in the direction opposite to that of the gradient
The weights are updated in the direction opposite to that of the gradient. The parameter η is a small step that we take in the direction opposite to that of the gradient and is commonly termed as the “learning rate”.
3) Return the weights.
7. Backpropagation:
?
The process of computing the gradient is termed as Backpropagation.
Mathematically, for a single layered neural network with two neurons as shown below, the gradient is computed using the chain rule of differentiation – backwards from the loss function throughout the output – as follows:
The algorithm of backpropagation is decades old, and the paper (1986) can be found here: https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf
It must be underscored that the landscape of the cost function involved in a deep neural network is highly complicated than the one shown above!
Setting the learning rate: η
Setting the learning rate can have very large consequences while building the neural network, Having the learning rate too small will make the travel to the lowest point in the landscape too slow (convergence is slow) whereas if the learning rate is high the calculation might bypass the point of global minima as intuitively illustrated below.
In practise an adaptive process is followed wherein the “learning rate” “adapts” to the landscape. When we say “adapt”, it means that the learning rate can be made smaller or larger depending on:
More details can be found here: https://www.ruder.io/optimizing-gradient-descent/
8.?Regularization:
????????????????
In the case of Neural Networks, regularization is typically done using "Dropouts". In Dropouts, during training, we essentially select randomly some subset of neurons in the neural network and prune these neurons with some probability. We randomly turn these neurons on and off at different iterations during training.
This essentially forces the neural network to learn an "ensemble" of different models. ???? ?????? ???? ?????????????????????? ???? ?????????????? ???? ?????????? ?????????????????? ????? ?????????????? ???????? ???? ?????????????? ???? ?????????????????? ???????????? ???????????????????? ??????? ????? ?????? ???? ????? ???? ????? ???????????????? ???????????????????? ?????????????? ?? ?????????????????? ?????? ???? ?????????????? ?????? ???????????? ???? ?????? ??????. This results in being a very powerful technique and helps in generalizing better.
?????????? ????????????????
The next regularization technique often practised for Neural Networks in "Early Stopping". Here the Data Scientist will normally plot the performance of the network on the training and the test data. As the network is trained, one would notice both the training and the test set loss decrease but a stage is reached where the training error continues to decrease but the test set error begins to increase. IT is at this point essentially that the model is beginning to overfit. And it is this point one would want to stop the training process as otherwise the model will learn the training data very precisely but not perform well on unseen data (overfitting).
I have discussed in (some) detail on Regularization in an earlier blog here: