Bias and Variance in Good Fit Models
Pratap Padhi
Assistant Professor in Actuarial Science | Former Data Science, AI/ML & Actuarial Science Professional| Corporate Trainer| Experienced Across Academia and Industry|Founder
The ultimate goal in model building is to achieve a good fit to the data while simultaneously managing underfitting and overfitting.
This is achieved through a bias-variance tradeoff where the goal is to find a sweet spot to optimize model performance on unseen data.
The bias-variance tradeoff is a crucial concept in achieving this balance. You can read my other article where I discussed them in detail.
While overfitting, underfitting, bias, and variance are often discussed together in the context of the bias-variance tradeoff, Models can exhibit bias or variance issues even without explicitly overfitting or underfitting, depending on their assumptions and complexity.
As shown below even though neither model is overfitting or underfitting the data, there is still a trade-off between bias and variance.
?
Bias and Variance Issues without Overfitting or Underfitting
Bias and variance are broader concepts in statistics and machine learning, and they can exist independently of underfitting and overfitting in different scenarios.
Example 1: (Treating as a regression problem)
Can we predict the winning lottery number?
Underfit-Train with a linear regression model.
Overfit-Train with a neural network with many layers, and a large number of neurons trained long enough, and consequently, it will memorize the inputs-output pairs in the training set.
Good fit- We can finalize our model with low MSE directly through bias-variance trade-off methodologies and indirectly through AIC or BIC.
Finally, if we train a model on past data with the best-fit model, with some features and the outputs as the winning numbers, will it be able to predict?
领英推荐
A model that can fit seen data to a reasonable extent doesn't mean that it will perform well on unseen data.
In the case of predicting lottery numbers, our good-fit model still suffers from bias and variance. Bias arises due to the lack of patterns or meaningful relationships in the data, while variance is introduced by the ample number space and the randomness inherent in lottery draws.
Example 2:
Suppose we are tasked with predicting students' final exam scores based on the number of hours they spent studying. We decided to use a linear regression model, assuming that there is a straightforward linear relationship: more study hours lead to higher exam scores.
However, the real relationship between study hours and exam scores is not strictly linear. It is more accurately represented as an exponential relationship. In other words, students who study more hours benefit from additional study time, but the returns diminish as they study even more.
In this scenario:
If our linear regression model consistently predicts exam scores that are, on average, 5 points below the actual scores, it exhibits bias.
This bias arises because the model simplifies the relationship as linear when it is, in fact, nonlinear. The model systematically underestimates the exam scores for students who studied more hours.
Despite the bias, the linear regression model may not necessarily be underfitting, as it's still a reasonable approximation for the data. It captures the overall trend but fails to capture the nuances of the nonlinear relationship.
So, bias can exist in a linear regression model when it makes systematic errors in predictions, even when the true relationship between variables is not strictly linear
Example 3:
Consider a decision tree model for predicting whether a loan application will be approved. If the model consistently makes overly optimistic predictions (e.g., it supports high-risk applications), it has bias. This bias exists because the model systematically favors one outcome.
Now, suppose we use an ensemble of decision trees like Random Forest. Each individual tree is relatively simple because it doesn't fit the data too closely. However, when combined, they provide a more robust and accurate prediction. In this case, the ensemble model can have lower variance compared to a single complex tree, even though it's not overfitting.
Here, bias and variance exist independently of underfitting and overfitting. The single decision tree may have bias but might not necessarily be overfitting, while the Random Forest ensemble manages variance without necessarily overfitting.
These examples illustrate that bias and variance are broader concepts present in various models and situations, independent of whether a model is underfitting or overfitting.
In all cases, the models are not explicitly overfitting or underfitting the training data, yet they exhibit bias or variance problems due to their underlying assumptions and complexities. #artficialintelligence #machinelearning #dataanalytics #bussinessintelligence #actuarialscience
AVP Data & Analytics @ HSBC | Performance Marketing & Customer Experience Expert | MMM & DDA | CLCM | Digital Analytics
1 年It is a good one. Expecting more to come from you in future.