登录查看更多内容

Linear Regression: Concepts, Notation, and Overfitting

Kay Yan

CEO @ Hello Cosmos Ltd | Citrus Data Technology Co., Ltd. | QU & MIT PE | SRE = Semi Reliability Engineer

发布日期: 2025年1月17日

+ 关注

Notation Key

Vectors: boldface
Scalars: normal font
X?: second data record
X?: second component of dataset
Number of data records: n
"Star" for true quantity, e.g., θ*
"Hat" for estimates, e.g., θ?
Upper case: random variables, e.g., Y
Lower case: numbers and constants, e.g., y for realized value of Y

New Insights

To find relationships between different variables to get new patterns.

Regression Framework

Vector X? | Y?

Dimension m : | :

Data : | :

X? | Y?

---+---

X | Y? ← Scalar

Regressor/predictor: ? = g(X)

Goal: Build function g so when a new X comes in, we can output predicted value ?. X → [g] → ?

We need to learn a good g from the data.

Objective Function

Where:

g(X?) is the prediction
Y? is the actual value
g(X?) - Y? is the error, which we want to be small
The whole function is mean square error (MSE)

Overfitting Warning

If it's an arbitrary curve (a line that goes through all data points, so the error is 0), we can't believe it. This is called overfitting, which we need to avoid since it leads to nonsensical conclusions.

[Graph showing overfitting: A wiggly line that perfectly passes through all points, versus a simpler linear fit]

领英推荐

FROM LINEAR REGRESSION TO ANOVA, A CONSTANT VARIABLE

MBSoft 1 年前

Ridge Regression: Tackling Bias-Variance Tradeoff

Shakil Khan 5 个月前

How to Deal with Multicollinearity?

Mohammad Arshad 2 年前

Linear Regression Model

We prohibit g to be arbitrary general, restrict to limited class of predictors:

Within linear regression, we restrict to the class of predictors that are linear in the attributes in the X vector.

When a person comes in with X? up to X?, then form a linear combination of these attributes.

Choice of predictor β = curve in 2D β determines the location of that line.

The slope here would be β?. By playing with β, we can move the line around.

Residuals

Residual is an error between predicted value and the observed value.

We need to find θ so the sum of squared residuals is as small as possible:

Any line gives certain numerical value for the sum of squared residuals. This is called ordinary least squares (OLS).

Assumptions of Linear Regression

Linearity - The regression model can be expressed linearly
Homoscedasticity - The variance of the error is constant
Independence - Observations are independent from each other
No autocorrelation

Example Application

n = 209
m+1 = 4
θ is the 1
θ? = [2.94, 0.064, 0.19, -0.001]?
Sales = 2.94 + 0.064(TV) + 0.19(Radio) - 0.001(Newspaper)

Linear regression software produces the coefficients that multiply the ads expenditure over different channels.

-0.001 is unusual as it suggests the more you spend, the lower the sales.

Simple linear regression example: Sales = 12.35 + 0.055(Newspaper)

0.055 contradicts with -0.001, so which one is true?

要查看或添加评论，请登录

Kay Yan的更多文章

General Concept of Classification in Machine Learning

2025年1月30日

General Concept of Classification in Machine Learning

Definition of Classification Classification is the task of assigning a data record to one of multiple categories based…
Bootstrapping in Machine Learning

2025年1月29日

Bootstrapping in Machine Learning

Standard Error and Variance The standard error (SE) is the square root of the variance: where: Variance is the squared…
Drawbacks of the Validation Process & K-Fold Cross Validation

2025年1月27日

Drawbacks of the Validation Process & K-Fold Cross Validation

The Challenge of Model Validation Before diving into k-fold cross validation, it's important to understand why we need…
Model performance using validation

2025年1月26日

Model performance using validation

Systemic Requirements for Model Evaluation Assess Standard Error Set Hyperparameters Choose Variables/Features to…
Feature Engineering in Predictions

2025年1月23日

Feature Engineering in Predictions

Understanding Latent Variables and Model Enhancement in Regression 1. Introduction to Latent Variables Latent variables…
What could possibly go wrong? (Heteroskedasticity, Multicollinearity, Latent variables)

2025年1月22日

What could possibly go wrong? (Heteroskedasticity, Multicollinearity, Latent variables)

# What could go wrong? - Heteroskedasticity - Multicollinearity - Latent variables # Using nonlinear features of the…
Confidence band for linear regression

2025年1月21日

Confidence band for linear regression

Linear Regression Model: Y = θ?* + θ?*X? + ..
Distribution, Standard Error, and Hypothesis Testing

2025年1月19日

Distribution, Standard Error, and Hypothesis Testing

Key Concepts Distribution of estimator hat \hat{\theta} Standard error of an estimator and confidence interval…
R2 Statistics and Model Assessment in Linear Regression

2025年1月18日

R2 Statistics and Model Assessment in Linear Regression

Linear Regression Fundamentals Initial Assumptions True relation may be complex Interested in best linear predictor…

See all articles

Linear Regression: Concepts, Notation, and Overfitting

Kay Yan

CEO @ Hello Cosmos Ltd | Citrus Data Technology Co., Ltd. | QU & MIT PE | SRE = Semi Reliability Engineer

Notation Key

New Insights

Regression Framework

Objective Function

Overfitting Warning

领英推荐

Linear Regression Model

Residuals

Assumptions of Linear Regression

Example Application

Kay Yan的更多文章

社区洞察

其他会员也浏览了

Concise Basic Stats - Part VII: Linear Regression

Understanding Multiple Linear Regression: A Comprehensive Guide

The Distribution of Independent Variables in Regression Models

A Tutorial on Ridge and Lasso Regression

How to Combine Complicated Models with Tricky Effects

How to deal with Multicollinearity?

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

10 Assumptions of Linear Regression

The Structure of Tests - The Experiment

Logistic Regression

Notation Key

New Insights

Regression Framework

Objective Function

Overfitting Warning

领英推荐

Linear Regression Model

Residuals

Assumptions of Linear Regression

Example Application

Kay Yan的更多文章

General Concept of Classification in Machine Learning

Bootstrapping in Machine Learning

Drawbacks of the Validation Process & K-Fold Cross Validation

Model performance using validation

Feature Engineering in Predictions

What could possibly go wrong? (Heteroskedasticity, Multicollinearity, Latent variables)

Confidence band for linear regression

Distribution, Standard Error, and Hypothesis Testing

R2 Statistics and Model Assessment in Linear Regression

社区洞察

其他会员也浏览了

Concise Basic Stats - Part VII: Linear Regression

Understanding Multiple Linear Regression: A Comprehensive Guide

The Distribution of Independent Variables in Regression Models

A Tutorial on Ridge and Lasso Regression

How to Combine Complicated Models with Tricky Effects

How to deal with Multicollinearity?

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

10 Assumptions of Linear Regression

The Structure of Tests - The Experiment

Logistic Regression