The birth of correlation and regression
Suppose that you are in the XIX century and for some reason you want to answer how much of a son's height can you predict with the parent's height. Francis Galton had just that question he wanted to answer, while doing this he developed the concepts of regression and correlation. To pursue this analysis we'll use the heights of the family of Galton available through the HistData package in , HistData
A look at the distribution of heights of sons and fathers
Don't knowing about correlation we could use perhaps the mean of the values of sons heights, about 70.45, that would most certain minimize our chances of being wrong. But can we do better? Sure! What about if we think about conditions? If we know for a fact that the father has 72 inches, could this piece of data allow us to make a better prediction ? We have the data, we can experiment on that. The problem then appears, we just have 8 fathers with exact 72 inches, what about rounding the values to the next inch? We go from 8 to 14, this still doesn't look good, but let's test it. If we let X = 72, X being the fathers height, what could we expect from Y (sons height) ?
We have updated our initial guess from 70.45 to the expected value of 71.83 now. Let's go beyond, now if we look at all the groups ? What will you expect ? We can analyse this data with boxplots
If we look closely, the sons expected value seem to be following a line. Francis Galton showed that
Where rho is the correlation between two variables. If we get this into the standard form of a line we get the Regression Line, that is used to predict the value of y given x. Using the Regression Line is so much better then making a prediction based on the conditional value because it uses all the data, not just the conditioned value. We can calculate this values with our data and plot the Regression Line now.
Using our Regression Line we can predict for a father with 72 inches, a son would have 71.9 inches. But wait, when does this Regression Line would make a good prediction on our data? We have to be carefully using it, to make the case, we can use the Anscombe's Quartet, that shows for 4 different datasets, we can have the same correlation coefficient. Just in some of then it would be a good idea to use the Regression Line to make a prediction.
Next I'll be making more posts exploring when is a good case to use based on bivariate normal distributions.