登录查看更多内容

Bias and Variance in Good Fit Models

Pratap Padhi

Assistant Professor in Actuarial Science | Former Data Science, AI/ML & Actuarial Science Professional| Corporate Trainer| Experienced Across Academia and Industry|Founder

发布日期: 2023年9月18日

The ultimate goal in model building is to achieve a good fit to the data while simultaneously managing underfitting and overfitting.

This is achieved through a bias-variance tradeoff where the goal is to find a sweet spot to optimize model performance on unseen data.

The bias-variance tradeoff is a crucial concept in achieving this balance. You can read my other article where I discussed them in detail.

While overfitting, underfitting, bias, and variance are often discussed together in the context of the bias-variance tradeoff, Models can exhibit bias or variance issues even without explicitly overfitting or underfitting, depending on their assumptions and complexity.

As shown below even though neither model is overfitting or underfitting the data, there is still a trade-off between bias and variance.

MSE-Bias-variance tradeoff analysis is to find the polynomial degree that minimizes the testing error for the above points

Bias and Variance Issues without Overfitting or Underfitting

Bias and variance are broader concepts in statistics and machine learning, and they can exist independently of underfitting and overfitting in different scenarios.

Example 1: (Treating as a regression problem)

Can we predict the winning lottery number?

Underfit-Train with a linear regression model.

Overfit-Train with a neural network with many layers, and a large number of neurons trained long enough, and consequently, it will memorize the inputs-output pairs in the training set.

Good fit- We can finalize our model with low MSE directly through bias-variance trade-off methodologies and indirectly through AIC or BIC.

Finally, if we train a model on past data with the best-fit model, with some features and the outputs as the winning numbers, will it be able to predict?

Dr. Vivek Pandey 1 年前

Understanding statistical inference

Ajit Jaokar 5 个月前

The Emotional Journey of Machine Learning: How Models…

Vinay Kumar Sharma 1 个月前

A model that can fit seen data to a reasonable extent doesn't mean that it will perform well on unseen data.

In the case of predicting lottery numbers, our good-fit model still suffers from bias and variance. Bias arises due to the lack of patterns or meaningful relationships in the data, while variance is introduced by the ample number space and the randomness inherent in lottery draws.

Example 2:

Suppose we are tasked with predicting students' final exam scores based on the number of hours they spent studying. We decided to use a linear regression model, assuming that there is a straightforward linear relationship: more study hours lead to higher exam scores.

However, the real relationship between study hours and exam scores is not strictly linear. It is more accurately represented as an exponential relationship. In other words, students who study more hours benefit from additional study time, but the returns diminish as they study even more.

In this scenario:

If our linear regression model consistently predicts exam scores that are, on average, 5 points below the actual scores, it exhibits bias.

This bias arises because the model simplifies the relationship as linear when it is, in fact, nonlinear. The model systematically underestimates the exam scores for students who studied more hours.

Despite the bias, the linear regression model may not necessarily be underfitting, as it's still a reasonable approximation for the data. It captures the overall trend but fails to capture the nuances of the nonlinear relationship.

So, bias can exist in a linear regression model when it makes systematic errors in predictions, even when the true relationship between variables is not strictly linear

Example 3:

Consider a decision tree model for predicting whether a loan application will be approved. If the model consistently makes overly optimistic predictions (e.g., it supports high-risk applications), it has bias. This bias exists because the model systematically favors one outcome.

Now, suppose we use an ensemble of decision trees like Random Forest. Each individual tree is relatively simple because it doesn't fit the data too closely. However, when combined, they provide a more robust and accurate prediction. In this case, the ensemble model can have lower variance compared to a single complex tree, even though it's not overfitting.

Here, bias and variance exist independently of underfitting and overfitting. The single decision tree may have bias but might not necessarily be overfitting, while the Random Forest ensemble manages variance without necessarily overfitting.

These examples illustrate that bias and variance are broader concepts present in various models and situations, independent of whether a model is underfitting or overfitting.

In all cases, the models are not explicitly overfitting or underfitting the training data, yet they exhibit bias or variance problems due to their underlying assumptions and complexities. #artficialintelligence #machinelearning #dataanalytics #bussinessintelligence #actuarialscience

Bias and Variance in Good Fit Models

Pratap Padhi

Assistant Professor in Actuarial Science | Former Data Science, AI/ML & Actuarial Science Professional| Corporate Trainer| Experienced Across Academia and Industry|Founder

As shown below even though neither model is overfitting or underfitting the data, there is still a trade-off between bias and variance.

Bias and Variance Issues without Overfitting or Underfitting

领英推荐

社区洞察

其他会员也浏览了

Machine Learning - Hyperparameter Tuning

SHAP: Bridging the Gap Between Machine Predictions and Actionable Recommendations

How (not) to use Machine Learning for time series forecasting: The sequel

What Is Lasso and Ridge Regression in Machine Learning?

Effective XGBoost by Matt Harrison

Model Fine-Tuning

Tackling Imbalanced Data in Machine Learning: A Comprehensive Guide

XGBoost

Using ML techniques to determine stability constants from a multivariate data: a example of the pKa determination from the absorption spectra

Data Optimizations Techniques in the Machine Learning