登录查看更多内容

Recalling Linear Regression!

Amal Dhivahar S.

Project Associate @SGBC, IIT Madras | Medical Imaging | Image Processing | Computational Morphometrics | Biotechnology

发布日期: 2024年3月27日

Recently, I stumbled upon an intriguing course titled "Introduction to Machine Learning," from Great Learning Academy by Dr. Abhinanda Sarkar igniting my curiosity about this dynamic field. As someone inherently fascinated by technology, I couldn't resist the urge to dive in and understand the fundamentals.

In one particular module, "Introduction to Machine Learning and Linear Regression," I found myself transported back to my college days, revisiting the statistical concepts I once studied. Surprisingly, the memories of linear regression came flooding back, highlighting its relevance in machine learning.

I wanted to share this experience as a humble reminder of the continuous learning process we all embark on. While my understanding of machine learning is still evolving, revisiting linear regression served as a valuable milestone worth sharing.

??Machine Learning from a Novice Learner's POV:

Machine Learning (ML) fundamentally revolves around the capability to execute tasks derived from underlying models, a product of iterative learning processes fueled by vast real-world datasets. ML algorithms meticulously analyze this data to discern patterns encompassing trends, cycles, associations, and more. Subsequently, these patterns are translated into mathematical representations, such as probabilities or polynomial equations. I also got to learn the differences between supervised and unsupervised machine learning categories.

In the sphere of data science, the process of ML unfolds in several critical stages:

Data Identification: Identifying the pertinent datasets is foundational.
Data Pre-processing: This crucial step involves managing outliers, missing data, and formatting, alongside crucially addressing data scale through normalization.
Creating Training and Test Sets: Delineating between training data, utilized for model construction, and test data, employed for validation.
Algorithm Selection: The choice of algorithms profoundly influences model efficacy.
Model Training and Construction: Executing the chosen algorithms to build the model.
Model Evaluation: The model undergoes rigorous evaluation using the test dataset.

While steps 1 to 3 demand the lion's share, encompassing around 80-85% of the project timeline, step 4 typically consumes approximately 5%. The final phases of model training and evaluation, steps 5 and 6, respectively, account for the remaining 10%. Notably, time-saving resources like the Modern Ready Data (MRD) database expedite the process by providing pre-processed data, facilitating a streamlined entry directly into step 4.

?Regression Reminiscence:

“Regression” fundamentally means going back to the mean and predicting a real number. “Linear Regression” is when a method models data with a linear combination of the explanatory variables. This can be expressed as:

response (y) = intercept (β) + constant (α) * explanatory variable (x)

A simple linear regression is always additive in nature. E.g.: y = x1 + x2 + x3, which represents a linear relationship whereas, y = x1 + x2 * x3 is non-linear.

∴ when β = 0 and α = 1, y = x; when x = 0, y = β; when α is positive, x and y are directly proportional; when α is negative, x and y are inversely proportional; i.e., as α increases the correlation between x and y also increases.

Correlation is the same as Covariance with a small difference (“Same same but diffalent" ??). Covariance is influenced by units (x,y), unlike coefficients, which are easy to express and quantify. The coefficient of relation a.k.a. the Pearson’s coefficient ρ(x, y) can be expressed for a sample set as:

rxy = Cov(x,y) / (σx * σy)

where σ is the standard deviation.

∴ when r is near +1, the correlation between x and y is positive; when r is near -1, the correlation is negative; when r = 0, there is no linear correlation between x and y (other forms might exist). There are models that exhibit the relationship between their variables in a non-linear fashion.

??LPM & LDM Cases:

To understand this better, let's delve into the concepts of Linear Probability Model (LPM) and Linear Deterministic Model (LDM) regression cases. In a LDM no elements of uncertainty exist. For a given value of x, a value of y exists, and if x changes, y has to change correspondingly, and hence the model will only have to deal with regression (SSR) and ? errors. This can be expressed as:

where, ?is the y-intercept (x=0), and is the slope (change in y when x increases by 1 unit).

Whereas in the case of a LPM, for a given value of x, there could be two values of y (not a fixed value), leading to uncertainty. The model will have to deal with both SSR and residual errors (SSE) as well as the ? error. In simpler terms,

领英推荐

Introduction to Simple Linear Regression in Machine…

Learnbay 2 年前

Choosing the Right Machine Learning Algorithm: A…

Doug Rose 3 周前

Where Data Becomes Intelligence!

Minhas Freight Services 4 个月前

Response = constant + explanatory variable + ∑

where ∑ is the unexplainable error or the consolidated residual error expressed as:

Residual errors can be minimized by achieving the best-fit line, which is the regression line, where the error is minimal across all data points.

??Significance of R2:

So how do we determine the best fit line?

Answer: using R2= coefficient of determination of a regression model.

When a line is best fit with the least sum of squares values across all data:

R2 ≈ 1

When R2 is closer to 1, we’ve achieved the best possible fit (ideally). But there is always a ∑.

Total error (SST) = SSE + SSR

When SSE=0, SST= SSR is the best scenario possible where R2 = 1.

SSE is the controllable part, which is the difference between the actual data value and the observed value based on a regression model/equation. While SSR is actual value vs ? which is the mean of all values of the dependent variable. SSR will not be considered during calculations as it cannot be controlled.

Calculating slope and intercept through SSE

Alternative way: using covariance and sample variance

??Cons of Linear Regression:

Though Linear Regression is simple to implement and easy to interpret the output coefficients, it has its cons in assuming a linear regression between the x and y variables and independence between its attributes. Outliers can also have a huge effect on the linear regression model.

Influence of outliers on your Linear Regression Model

Once the regression equation is fixed, it can be tested/interpreted but only within the ranges of x values observed in the original test data set.

I trust you found this helpful! Here's to embracing the learning journey and discovering unexpected connections along the way! #MachineLearning #LinearRegression #LearningJourney #Reflections

My certificate from Great Learning Academy

Mohammed Adhil S

11 个月

Really statistics and ML/DL algorithms are amazing topics. Love to see such post like this that make us to recall and refresh everything with the basics. Good work Amal Dhivahar S. And hope to see you posting many similar topics like this that would motivate readers to have deep dive experience on these topics.

1 次回应

查看更多评论

Recalling Linear Regression!

Amal Dhivahar S.

Project Associate @SGBC, IIT Madras | Medical Imaging | Image Processing | Computational Morphometrics | Biotechnology

??Machine Learning from a Novice Learner's POV:

?Regression Reminiscence:

??LPM & LDM Cases:

领英推荐

??Significance of R2:

??Cons of Linear Regression:

社区洞察

其他会员也浏览了

Balancing Act: The Pros and Cons of Machine Learning Algorithms

The Role of Feature Engineering in Machine Learning Success

Deep Dive: Linear Regression

Machine Learning Across Industries: Transforming the Future with Intelligent Algorithms

ML Day 16: Real-World Project Example Using ML

CRISP-DM Process for Machine Learning Projects

AN INTRODUCTION TO MULTIPLE LINEAR REGRESSION IN ML

Simple Linear Regression

Day 2: The MLOps Lifecycle

XGboost