Understanding statistical definition of Bias and Variance
Why Bias and Variance matter?
In practical Machine Learning, we would want to know if trained models are overfitting or under-fitting.
Overfitting is associated with high Variance and under-fitting is associated with high Bias.
But the use of term variance is not exactly the same as the statistical variance.
Why overfitting is associated with high Variance?
Overfitting means that if we pick a different training set, we would get a different model. That is roughly what we mean by Variance.
But Variance has a very specific definition in statistics. How would that relate to the above definition?
In more exact terms the Variance is the variance of the predictions for y_0 = f(x_0).
Statistic definition of variance
In statistics variance is the average of the squared differences from the Mean of a set of number.
This brings us to our next question. Variance over which set? Answer is over set of all possible training sets. Yeah sets of sets sounds too nerdy to me as well.
To make this more concrete I use a 1D input feature X and sampled 1000 Y's for that.
On the left side you you see the whole data set X in Grey and the training set we have chosen in blue.
On the right side you see the line that would be trained on the training set.
The red dots is the test datapoint for x0 = 10 and it's trained value y0.
Now you might ask why we might sample different training sets. Why don't we just use the whole data set?
The answer is that we don't know the whole data set. The whole data set is the population and we can only have some assumption about it.
I will talk about the population data set and Bias-Variance tradeoff in next post.