How AI learns
Created with DALL-E

How AI learns

Assuming that you have already read the first article Introduction to Artificial Intelligence, we will see how AI learns in simple terms and using only basic math.

We learned in the first article about datasets, supervised and unsupervised learning, and two classical problems of labeled datasets: regression and classification.

We will now use a very simple regression problem and a labeled dataset with only one feature X (independent variable), 7 observations, and for each the true value Y (dependent variable). Having only one feature the data can be easily plotted on a graph (see Table1 and Figure 1)?

Table 1
Figure 1

Before delving into the learning process, we need to say that, as a first step, a data scientist would split the dataset into two parts (even though with such a limited number of observations it is quite a nonsense): the first part is used to train the model (Training data) and the second part is used to test the model (Testing data); what we are truly interested in is the performance of the model on new data, so we train the model on the first part of the dataset and we test it on the second part.?

It is wrong to develop a model that performs perfectly on training data (Overfitting) but not on test data because we need a model that works well on new data.

Let's start by randomly extracting 5 random observations from this dataset as training data (table 1).

Table 2

From Figure 1, it is clear that there is no straight line (Linear Regression) that can approximate the point, nor a quadratic function.

So, we will try a cubic function of the X variable f(X) = K * X ^ 3 with a parameter K and without a bias term (a second parameter added to X ^ 3)

The learning process in this simple case is configured as the search for the best value for K so that the curve will be as close as possible to the points shown in Figure 1. Of course, we could solve analytically the problem by finding the best K, but this is not possible for a real AI model.

At the start of the learning process, we pick a random K (i.e. K = 1) and we measure how far the predictions ? are from the real values Y? (targets), Errors = ?? - Y, see Table 3.

Table 3

The function F(X) = 1 * X ^ 3 is not perfect, because there are significant errors (difference between predictions and real values), particularly at the extreme values (see Figure 2) but the shape of the curve sounds promising to approximate the blue dots (real values).

Figure 2

We can summarize the errors shown in Table 2 in a single number that represents our Cost Function or Loss function that depends on parameter K; for a regression problem, this number is the average value of the errors column in Table 2, where both the errors in excess and defect equally contribute to the result.?

To be more precise, we take each error value, square it, sum all these numbers, calculate its square root, and divide it by the number of observations to obtain what is called the RMSE (Root Mean Square Error) as a measure of the accuracy of the model.

If we do the math for k=1, we have :?

(115,5)^2 + (-15,3)^2 + (-7,5)^2 + (-27,1)^2 + (-148,4)^2 = 36.387,56

Square Root(36.397,56) = 190,75

190,75/5 = 38,2

If we repeat this process for different values of K we obtain the graph in Figure 3,? where the RMSE for different values of the parameter K is plotted.

Figure 3

From Figure 3, we can see that the minimum value of the RMSE is around k = 3; so what we need now, is an algorithm to find it and the answer is: the Gradient Descent algorithm (gradient is a different term for derivative).

If we are on the left part of the graph in Figure 3, the curve has a negative slope (gradient) and the steepness reduces until it reaches the lowest point on the curve; if we are on the right part of the curve (K >3) the slope is positive.

The gradient descent suggests that we move in little steps in the direction opposite to the sign of the slope; if K < 3 we move to the right and if K > 3 we move to the left until we reach a point on the curve where the slope is close to zero.

We can visualize this process as a ball rolling in a bowl that after a few rounds around the bottom of the curve finally will sit at the bottom.

In a real AI problem, the number of features can be large, and the number of observations is extremely large so this process can have high computational cost.

We can finally plot the curve for K=3 the curve (Figure 4) to be convinced that now the curve is very close to the observations of the training data.

Figure 4

As a last step, we can calculate the accuracy of the model on the test data (in this case, just the two observations excluded from the training dataset) and double-check that the accuracy is good enough on unseen data as well.


When working on a classification problem, we know from the first article Introduction to AI, that for each observation we have a target value that is 1 (meaning True) or 0 (meaning False).

The predicted output is True or False for each observation as well, according to a probability value calculated by the AI model.

So, for each prediction, the error will be either 0 or 1 and thus we can't create a derivable loss function using the RMSE as we did for a regression problem; this is why, for a classification problem a different Loss function (based on logs and the concept of cross-entropy) is used; then,?using this new Loss function, the same Gradient Descent algorithm is used to find the minimum.

To recap, the comparison between a classification and a regression problem is:

1 - the algorithm to find the minimum is the same (Gradient Descent),?

2- the Loss/Cost functions are different?

3- the measure of the performance of the model must be different as well because we are interested in how many predictions are correct and not the distance between predicted and real values as in a regression model.

We are closing the article by elaborating a little bit on the third point: the AI model can predict True when the actual value is True (True Positive), True when the actual value is Negative (False Positive), Negative when the actual value is Negative (True Negative), and, finally, it can predict Negative when the actual value is Positive (False Negative); the Confusion Matrix is shown in Table 4.

Table 4

A straightforward method to measure the performance of the AI model is the Accuracy defined as the number of True predictions (TP + TN) over the Total Population (TP+FP+FN+TN).

However, this measure is not sufficient in situations where the number of positives is very small compared to the total population; in this scenario, a model always predicting a negative outcome would work well in terms of accuracy but would be not intelligent enough to detect the True Positive, as in the example in Table 5.

Table 5

This model has an accuracy of 97% (1001/1031) but it can detect only 1 over 21 actual positives and, depending on the application we are working on, it can be a serious problem.

This is why additional performance indicators are used, such as:

Precision = TP / (TP + FP) indicates how much we can trust a positive prediction of the model, in this case, 9%

Recall = TP / (TP + FN) indicates how many real positives the model can detect, in this case less than 5%?

Well, this is it for now !!!?

We now know that there is nothing magic about the learning process but only the search for the minimum of a multivariate function (a function with more than one argument) !!

I am eager to know if you have found the content accessible and clear.





Sophia Shah

Empowering FinTech Businesses with AI & Machine Learning for Scalable Automation | Experts in Generative AI & NLP

1 年

Really informative!

回复

Great series! Looking forward to learning more about generative AI. Keep up the good work!

要查看或添加评论,请登录

Luigi Vassallo的更多文章

  • Harnessing the Power of Ensembles: Lessons from Machine Learning for Management

    Harnessing the Power of Ensembles: Lessons from Machine Learning for Management

    As I recently completed the Machine Learning Scientist track on DataCamp, I found myself particularly intrigued by the…

  • LLM part 4

    LLM part 4

    Pre-requisite: LLM LLM part 2 LLM part 3 6. Retrieval Augmented Generation (RAG) This topic is worth a specific…

    1 条评论
  • Large Language Models - Part 3

    Large Language Models - Part 3

    Pre-requisites: LLM LLM part 2 After the attention mechanism, we need to quickly describe the last transformer modules…

  • Large Language Models - part 2

    Large Language Models - part 2

    This follows part 1: Large Language Models 2.3 Positional Encoding Word embedding represents the meaning of the word in…

    1 条评论
  • Large Language Models

    Large Language Models

    1. Introduction Large Language Models (LLMs) are advanced artificial intelligence systems engineered to understand and…

  • Deep Neural Networks

    Deep Neural Networks

    I am assuming that we are already familiar with the first three pills (Introduction to AI, How AI Learns, Different…

    3 条评论
  • Different types of AI

    Different types of AI

    In the first two pills, Introduction to AI and How AI Learns, we had a basic understanding of which kinds of problems…

    1 条评论
  • Introduction to Artificial Intelligence

    Introduction to Artificial Intelligence

    As humans, we go to school when we are young and we produce goods and services with our work activity as adults;…

    12 条评论
  • "Kiss Up, Kick Down" in the Workplace

    "Kiss Up, Kick Down" in the Workplace

    Have you ever witnessed "kiss up, kick down" behavior at work? This phrase refers to a pattern where individuals are…

    3 条评论
  • Homo smartphonicus

    Homo smartphonicus

    Have you ever considered that in specific circumstances the decision process of the Homo Smartphonicus is not really…

    1 条评论

社区洞察

其他会员也浏览了