A linear regression story!
This was originally posted to my blog on 05th Feb 2022
"If you can’t explain it simply you don’t understand it well enough!" - A quote usually attributed to Albert Einstein without a proper citation.
I like the above-mentioned quote a lot because there were several times in my life when I took pride in learning some new concept but later fumbled while trying to explain or teach that concept to others. After contemplating and applying the wisdom of this quote during such times, it becomes clear to me that I was a bit too hasty in rushing to the conclusion that I understood the concept. Even one of the most famous theoretical physicists of all time, DR. Richard Feynman adopted this principle as part of his renowned ‘The Feynman Technique’ which tells us how the ability to regress a complex and intricate theory into a simple form is vital for our learning.
As humans, we are blessed to have the wonderful endowments of cognition and innovation which surely aid us in our learning despite not applying any one particular technique explicitly. But when it comes to machines, I feel this technique of simplifying the complex things to learn contributed enormously to the branch of machine learning. Many machine learning algorithms widely used today have been developed with some generalizations and assumptions that enabled them to work effectively with few limitations. Linear Regression is one such ML algorithm that beautifully simplifies the data to learn from it.
So, what exactly is this Linear Regression?
Well, it is as simple as the straight-line equation we have learned as a part of linear algebra during high school. Yes, I am talking about the classic ‘y = mx+c’ equation.
In the context of machine learning, this equation is slightly modified and presented as below:
Y = B + W*X
The slope of the line ‘m’ is replaced by something called ‘weights’?(W) and the y-intercept is replaced with a ‘bias’ term (B). To look into why these rephrased terms matter in machine learning, we will have to first take a sneak peek into a linear regression ML model.
Let us consider a simple and straightforward problem of trying to predict the salary of a person based on his age by building a model on the below sample dataset:
So far we have defined a problem and made up a simple dataset with which we can work. The next thing we need to do now is to differentiate our dataset as something called ‘features’ and ‘labels’.
To understand what features and labels are intuitively, let us go back to our favorite two-dimensional straight line equation. In the equation ‘y = mx+c’, we are calculating the value of a variable ‘y’ by performing some arithmetic operations on the variable ‘x’ and constants ‘m’ and ‘c’. So the value of the variable y is ‘dependent’ on variable x, which allows us to identify y as a ‘dependent variable’ and x as an ‘independent variable’. Other terms for these independent and dependent variables are features and labels respectively.
领英推荐
Now, let us try to plot our dataset in a two-dimensional system to see how it looks:
Well, this plot reminds me of the lunar eclipse alignment of the Sun, Earth, and Moon.
Now, we will do something called ‘fitting a model’ to this data, which is nothing but fitting a straight line through these 4 points. The equation of such line would be:
y = (80)*x + 0, as slope m = (800-400)/(20-15) = 80 and y-intercept = 0.
Next, if we want to predict the salary of an individual whose age is 45 with this model, all we have to do is substitute the values as below:
y = (80)*45+0 = $3600
Hence, a simple linear regression ML model tries to fit in a straight line equation for the data by computing two parameters – weight (slope) and bias (y-intercept). In the case of new and unseen data, an ML model just substitutes the features from this new data along with the weight and slope of the model to calculate the labels ‘y’.
This simple linear regression concept is further extended to build a few other regression algorithms. As we might have figured out by now, a regression ML problem deals with the prediction of some continuous variable like population or salary. But, how do we go from here to dealing with the problems like cancer prediction based on certain features, where the label is not a continuous numeric? In these types of problems, labels are usually called ‘categorical’, as they consist of categories (like benign or tumor). These types of problems are usually called ‘classification problems’.
In this post, we have seen that a linear regression ML model builds a simple ‘formula’ with which it performs some calculations to output the required value. But, can we apply the same technique to predict the class of a label? Can computing w*x+b tell us whether a tumor is benign or malignant? Answers to these questions lie in the form of ‘logistic regression’ about which I will discuss briefly in my next post.
Also, we need to have a small talk regarding the simplifications we assumed in the example model we built in this post. Firstly, we dealt with a single feature ‘age’, but in actual regression problems, there could be several features (sometimes these numbers could go up to several hundred!), in such cases we won’t have a nice two-dimensional system with a simple, linear model (straight line). In these multi-dimensional systems, a linear model would become a linear plane which might be difficult to visualize, but still, the linear regression algorithm won’t divert from its principle of simplicity. To understand what I am trying to say, let us assume that our problem has three features: age, experience, and skill-rating from which we will need to predict the salary. Now, the algorithm will calculate the weights for each of these features and builds the model as below:
Salary = Wa*(age) + We*(experience) + Ws*(skill-rating) + b, where Wa, We and Ws are weights calculated by the algorithm for respective features.
Another simplification we assumed is in the form of a nice, linear dataset. In the real world, the data is often filled with noise which causes it to deviate from a straight line or a linear plane. Hence, a 100% accurate ML model is usually impossible, so an ML algorithm tries to find the best model that comes close to representing the actual data. How it is done is a story for some other day, but to finish where we started, the story behind the ML algorithms serves as another piece of evidence for the ubiquitous nature of the quote mentioned at the beginning of this article!