Predictive Modeling - ML & DS

Predictive Modeling - ML & DS

Let's first understand the motive of this Article

So aspirants of Data Science, Artificial Intelligence & Machine Learning, this article is going to be an eye-opener to you in the field of Predictive Modeling.

A very easy and scratch implementation of Data Science & Machine Learning that will help you understand what actually prediction means !

Note that this is going to be very easy, but dealing with easiness is not what DS & ML were built for !

Year 1950s to Year 2024 and still counting this field of study is still under study as it contains so unimaginable potential that is and will surely continue to astonish our minds.


Clarity Matters !

If you are new to my LinkedIn or have missed my previous article or have slightly to no idea of what this term Data Science actually means, you may first read my previous article : Link


Not to let you become bore! Let's start directly from the implementation


Excel Sheet 1 : The Training Data


train_data.xlsx

On giving an eye to the data you can easily understand that this is nothing but a sum i,e; col1 + col2 = col3, so if i give you the new values of col1 and col2, will you be able of giving the answer for col3

Let's say col1 = 2 and col2 = 12, you will say that col1 + col2 => 2 + 12 => 14

You answered it because you knew the relation between col1 and col2 that is, say col1 be x, col2 be y and col3 be z, the relation established will be f(x,y) = x + y = z

But what if i give you such data of which you are unable to build a relation ? Like following :


Now if i say new value for col1 is 2 and for col2 is 12 (same as before), now tell me the answer for the value of col3 ? Surely you won't be able to (or maybe with intense difficulty), But Predictive Model Can ! We'll see how.

Now let's talk about what actually train_data.xlsx is and what's its purpose

The data which you can see in train_data.xlsx is going be the dataset on which the Predictive Model will learn, now question is on what basis ?

There will be 2 types of Inputs our Predictive Model will receive

  1. Input -1 : col 1
  2. Input -2 : col 2
  3. Output -1 : col 3

# Features and labels for training
X_train = train_data[['col1', 'col2']]       # Input 1 and 2 : X_train
y_train = train_data['col3']       
# Output 1 : y_train        

Understand in Easy Language :

In this predictive model, the model doesn't know that col3 is just the sum of col1 and col2. Instead of directly adding col1 and col2 together, the model tries to learn how col3 is related to col1 and col2 by looking at lots of examples in the training data. It figures out some rules or patterns between col1 and col2 that help it predict col3.

These rules are not the same as just adding the two numbers; the model looks for the best possible way to combine col1 and col2 to get as close as possible to col3.

So, even though the actual relationship is simple addition, the model doesn't know that—it's just trying to find a pattern that works well based on the data it sees, without understanding the exact operation like we do.

Understand in Slightly Technical Language :

In the predictive model provided, the model does not inherently know that col3 is simply the sum of col1 and col2. Instead, it tries to understand the relationship between the input features col1 and col2 and the target col3 by finding the best-fit line that minimizes the error between predictions and actual values.

The Linear Regression model achieves this by determining weights for col1 and col2 such that their weighted combination best approximates col3. The model, therefore, learns that col3 is related to col1 and col2 in a specific manner by optimizing these weights during training.

Although, in this case, the underlying relationship is a simple sum, the model itself approaches it as a general problem of predicting col3 based on the input values without assuming any specific mathematical operation like addition.

This approach allows machine learning models to find relationships for more complex and non-linear functions, even if the true relationship happens to be straightforward, as in this case.

This was an overview of what's going to happen, now let's understand the other dataset !

Excel Sheet 2 : The Testing Data


test_data.xlsx

This is the Testing Dataset, in which only 2 columns are given, and the model which is trained on the dataset train_data.xlsx will try to predict the col3 values for each row.

Line that starts predictions after the learning phase

When the model will begins the prediction on the above displayed data, it will not just add and produce the result, instead it will bring-forth the learning and on behalf of that, it will try predicting the values for this above show data.

And it can be easily seen that if we talk about the result for 1st row having data col1 = 5 and col2 = 5, we know the resultant is 10, but you'll be shocked to know that when the predictive model predicts the answer for this same row, it may be equal to 10 or might be very near to 10 it can even be 10 ± 0.000000001, accuracy depends on how much data the model is trained and if it really established the very close relation to the actual relation that is sum

Let's see the actual prediction done by the Predictive Model

This Predictive Model which was only trained on 8 rows of data, note that the vastness of training data can exceed up to trillions of tokens

8 rows was enough because the relation between col1, col2 and col3 is very simple
Output


Input: [6, 4], Predicted col3: 9.999999999999998
Input: [4, 8], Predicted col3: 11.999999999999998
Input: [7, 6], Predicted col3: 12.999999999999998
Input: [8, 4], Predicted col3: 12.0
Input: [9, 7], Predicted col3: 15.999999999999998
Input: [4, 8], Predicted col3: 11.999999999999998
Input: [1, 9], Predicted col3: 9.999999999999996
Input: [2, 4], Predicted col3: 5.999999999999998
Input: [3, 2], Predicted col3: 4.999999999999998
Input: [12, 12], Predicted col3: 24.0
Input: [100, 250], Predicted col3: 350.0
Input: [25, 75], Predicted col3: 99.99999999999999
Input: [45, 55], Predicted col3: 100.0
Input: [75, 25], Predicted col3: 100.00000000000001        

As you can see in row 1 of the output

Input: [5, 5], Predicted col3: 9.999999999999998        

For col1 = 5 and col2 = 5, it predicted the output 9.9.999999999999998, which we know is really close to the actual result 10


Taking another row 12

Input: [25, 75], Predicted col3: 99.99999999999999        

For col1 = 25 and col2 = 75, it predicted the output 99.99999999999999, which we know is really close to the actual result 100


Thing to understand is that the model is not actually adding col1 and col2, instead it's predicting the answer from its previous learning, explained in earlier section

That is why 5 and 5 is not getting 10 exactly, instead very near to 10, because of some noises or over-understanding the data that lead to more precision and eventually lesser accuracy.


Exceptional Predictions

Input: [8, 4], Predicted col3: 12.0
Input: [12, 12], Predicted col3: 24.0
Input: [100, 250], Predicted col3: 350.0
Input: [45, 55], Predicted col3: 100.0        

Explanation : The result which got predicted exactly and not nearly to its answers as in other cases, it's just a coincidence that these values anyhow came to be exact as the actual answer

Note that above explanation is not accurate it's just to give you a pass, rest when you'll dive deeper you'll get to know the actual reason


Now some graphs to get the feel how it can be visualized


Linear Regression Model Visualization

In the graph, the blue dots represent the data points, which means each dot is showing a pair of values for col1 and col2. These points are like coordinates on the graph that show where our data lies.

The blue line you see is called the Line of Regression. This line is basically trying to find the best way to fit through the data points. At first, the line is in a normal position, and then it begins to move and rotate, adjusting the slope (m) and intercept (c), which controls the angle of the line and where it crosses the y-axis. This line represents the equation y = mx + c, which is a simple formula to draw a straight line.

The line’s goal is to find the position where it is the closest to as many data points as possible. In other words, the line tries to be as close as it can to all the blue dots, which helps the model make good predictions. This process is what we call fitting the line to the data. Once the line has found its best position, it represents the pattern that the model has learned from the training data.


Let's see how Line of Regression is for our Predictive Model


Linear Regression 3D Views

Suggestion on which programming language to use for Data Science ML

  1. Python: Python is one of the most popular programming languages for Data Science due to its simplicity, readability, and extensive libraries for data manipulation, analysis, visualization, and machine learning. Libraries like Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow, and PyTorch make Python a versatile choice for various Data Science tasks. Moreover, Python has a large and active community, making it easy to find resources, tutorials, and support.
  2. R: R is specifically designed for statistical computing and graphics, making it an excellent choice for statisticians and researchers in academia and industry. R's vast collection of packages (e.g., ggplot2, dplyr, tidyr, caret) provides comprehensive tools for data manipulation, visualization, and statistical analysis. R's built-in support for statistical modeling and its interactive environment make it a powerful tool for exploratory data analysis and statistical research.

I will personally recommend you to start with Python, and after gaining some knowledge, understanding, experience, and confidence, you can switch to another programming language based on your requirements and considerations

GitHub link to the same project :

  1. Open Predictive Learning Model 1 directory


Conclusion

This is not the only article on Predictive Modeling - ML & DS, there is more article on the same topic which will be explaining further and leftover topics


Appreciation

A lot of applause for you that you read this article, i hope it inspired you to enter into this field, so i welcome you whole heartedly and i will be helping you on your every climbs


Consider Connecting

  1. LinkedIn Profile


2. Email : [email protected]


3. GitHub Profile


Thank You


Useful tips

回复
Nehal Ashraf

Senior Document Controller (Aconex) Design, Build and Commission 33 / 13.8KV Substations at Oxagon (NEOM)

5 个月

Your clear explanation of linear regression and its practical application is super helpful. It's amazing to see how data science can be simplified for everyday use. Looking forward to more insightful content from you!

Khushi Dubey

Full Stack Developer | Web Enthusiastic | B.Tech in Computer Science and Engineering

5 个月

Very informative

Azmi .

Full Stack Developer | Web Enthusiast | B.Tech in Computer Science and Engineering

5 个月

Interested to know more

要查看或添加评论,请登录

Rayyan Ashraf的更多文章

  • What is Data Science — A guide to the beginners

    What is Data Science — A guide to the beginners

    Let's first get a common and not so often easily understandable definition of What is Data Science :- Data science is a…

    2 条评论

社区洞察

其他会员也浏览了