登录查看更多内容

Why Mean Squared Error (MSE)? Why not any other loss function?

Martin Khristi

Automation & AI Consultant| Power BI Specialist | Microsoft Fabric Enthusiast | Azure AI Certified | AWS Certified | AI & ML Engineer | Data Strategy | Innovating Trustworthy AI for a Brighter Tomorrow

发布日期: 2024年3月17日

+ 关注

Say you wish to train a linear regression model. We know that we train it by minimizing the squared error:

But have you ever wondered why we specifically use the squared error?

See, many functions can potentially minimize the difference between observed and predicted values. But of all the possible choices, what is so special about the squared error?

In my experience, people often say:

Squared error is differentiable. That is why we use it as a loss function. WRONG.
It is better than using absolute error as squared error penalizes large errors more. WRONG.

Sadly, each of these explanations are incorrect.

But approaching it from a probabilistic perspective helps us understand why the squared error is the ideal choice.

Let’s begin.

In linear regression, we predict our target variable y using the inputs X as follows:

Here, epsilon is an error term that captures the random noise for a specific data point (i).

We assume the noise is drawn from a Gaussian distribution with zero mean based on the central limit theorem:

Thus, the probability of observing the error term can be written as:

Substituting the error term from the linear regression equation, we get:

This is called the distribution of y given x; when parametrized by θ

For a specific set of parameters θ, the above tells us the probability of observing a data point (i).

Next, we can define the likelihood function as follows

The likelihood is a function of θ. It means that by varying θ, we can fit a distribution to the observed data and quantify the likelihood of observing it.

We further write it as a product for individual data points because we assume all observations are independent.

The likelihood of observing all observations is the same as the product of observing individual observations

Thus, we get:

领英推荐

Regression Analysis: Financial Markets

Quantace Research 1 年前

Logistic regression can replicate multiple parametric…

Adrian Olszewski 1 年前

A Comprehensive Guide to Logistic Regression in…

Onurdesk 4 个月前

Since the log function is monotonic, we use the log-likelihood and maximize it. This is called maximum likelihood estimation (MLE).

Taking the log on both sides in the likelihood function

Simplifying, we get:

To reiterate, the objective is to find the θ that maximizes the above expression.

But the first term is independent of θ. Thus, maximizing the above expression is equivalent to minimizing the second term.

And if you notice closely, it’s precisely the squared error.

Thus, you can maximize the log-likelihood by minimizing the squared error.

And this is the origin of least-squares in linear regression.

See, there’s clear proof and reasoning behind using squared error as a loss function in linear regression.

Nothing comes from thin air in machine learning :)

But did you notice that in this derivation, we made a lot of assumptions?

Firstly, we assumed the noise was drawn from a Gaussian distribution. But why?

We assumed independence of observations. Why and what if it does not hold true?

Next, we assumed that each error term is drawn from a distribution with the same variance σ. But what if it looks like this:

In that case, the squared error will come out to be:

How to handle this?

I discussed the origin of all assumptions of linear regression in detail here:

Thanks for reading!

要查看或添加评论，请登录

Martin Khristi的更多文章

How to Build a RAG Over Your Microsoft Fabric Data – The Most Simple and 100% Low-Code Approach!

2025年3月10日

How to Build a RAG Over Your Microsoft Fabric Data – The Most Simple and 100% Low-Code Approach!

Introduction In today’s data-driven world, businesses need instant access to insights without the complexity of SQL…
Forecasting Web Traffic with Nixtla TimeGPT: A Smarter Approach

2025年2月19日

Forecasting Web Traffic with Nixtla TimeGPT: A Smarter Approach

In the ever-evolving landscape of data science, predictive analytics plays a crucial role in decision-making…

2 条评论
Here's what's new today in the AI Insights

2025年2月14日

Here's what's new today in the AI Insights

UK and US Refuse to Sign AI Declaration at Paris Summit Prompts to try with ChatGPT's scheduled tasks feature SambaNova…
SambaNova: The Fastest and Most Efficient AI Accelerator

2025年2月11日

SambaNova: The Fastest and Most Efficient AI Accelerator

This article is officially sponsored by SambaNova Introduction to SambaNova Systems SambaNova Systems is a pioneering…

4 条评论
Accelerating Time Series Forecasting with RAPIDS cuML

2025年1月18日

Accelerating Time Series Forecasting with RAPIDS cuML

Time series forecasting is vital for predicting future trends, optimizing processes, and mitigating risks. Traditional…
Analyzing Fabric Lakehouse Data Using Natural Language with PandasAI

2025年1月11日

Analyzing Fabric Lakehouse Data Using Natural Language with PandasAI

In this guide, we demonstrate how to analyze your Microsoft Fabric Lakehouse or Warehouse data using natural language…
Getting Started with RAPIDS cuDF on Your Machine

2024年12月24日

Getting Started with RAPIDS cuDF on Your Machine

RAPIDS cuDF is a GPU-accelerated DataFrame library that offers efficient data manipulation capabilities, leveraging…
Here's what's new today in the AI Insights

2024年12月11日

Here's what's new today in the AI Insights

google announced Gemini 2.0, our most capable AI model yet that’s built for the era of agents OpenAI Rolls Out Canvas…
From Text to Insights: Building an OCR App with Llama-3.2-Vision

2024年12月4日

From Text to Insights: Building an OCR App with Llama-3.2-Vision

Transform Images into Structured Markdown Using Llama-3.2 Multimodal With this app, you can upload an image and…
?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

2024年11月24日

?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

Quick Start with Crawl4AI Extracting Data with CSS Selectors (Traditional Method) Extracting Data with OpenAI LLMs…

See all articles

Why Mean Squared Error (MSE)? Why not any other loss function?

Martin Khristi

Automation & AI Consultant| Power BI Specialist | Microsoft Fabric Enthusiast | Azure AI Certified | AWS Certified | AI & ML Engineer | Data Strategy | Innovating Trustworthy AI for a Brighter Tomorrow

领英推荐

Martin Khristi的更多文章

社区洞察

其他会员也浏览了

Evaluation of logistic regression model ( Must read for all )

The Distribution of Independent Variables in Regression Models

The Day, Linear Regression fails - Example 1

Counting Too Many Zeros? Try Zero- Inflated Poisson Models

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

The ubiquity of Central Limit Theorem (CLT) | Regression Coefficients

10 Assumptions of Linear Regression

Understanding P-values is essential for improving regression models

Linear Regression

Fit & predict for regression

领英推荐

Martin Khristi的更多文章

How to Build a RAG Over Your Microsoft Fabric Data – The Most Simple and 100% Low-Code Approach!

Forecasting Web Traffic with Nixtla TimeGPT: A Smarter Approach

Here's what's new today in the AI Insights

SambaNova: The Fastest and Most Efficient AI Accelerator

Accelerating Time Series Forecasting with RAPIDS cuML

Analyzing Fabric Lakehouse Data Using Natural Language with PandasAI

Getting Started with RAPIDS cuDF on Your Machine

Here's what's new today in the AI Insights

From Text to Insights: Building an OCR App with Llama-3.2-Vision

?? Structured Data Extraction: Traditional CSS Selectors vs. OpenAI LLMs ??

社区洞察

其他会员也浏览了

Evaluation of logistic regression model ( Must read for all )

The Distribution of Independent Variables in Regression Models

The Day, Linear Regression fails - Example 1

Counting Too Many Zeros? Try Zero- Inflated Poisson Models

Application of Logistic Regression with LASSO regularization to predicting March Madness Results

The ubiquity of Central Limit Theorem (CLT) | Regression Coefficients

10 Assumptions of Linear Regression

Understanding P-values is essential for improving regression models

Linear Regression

Fit & predict for regression