The birth of correlation and regression

The birth of correlation and regression

Suppose that you are in the XIX century and for some reason you want to answer how much of a son's height can you predict with the parent's height. Francis Galton had just that question he wanted to answer, while doing this he developed the concepts of regression and correlation. To pursue this analysis we'll use the heights of the family of Galton available through the HistData package in , HistData

A look at the distribution of heights of sons and fathers

Don't knowing about correlation we could use perhaps the mean of the values of sons heights, about 70.45, that would most certain minimize our chances of being wrong. But can we do better? Sure! What about if we think about conditions? If we know for a fact that the father has 72 inches, could this piece of data allow us to make a better prediction ? We have the data, we can experiment on that. The problem then appears, we just have 8 fathers with exact 72 inches, what about rounding the values to the next inch? We go from 8 to 14, this still doesn't look good, but let's test it. If we let X = 72, X being the fathers height, what could we expect from Y (sons height) ?


We have updated our initial guess from 70.45 to the expected value of 71.83 now. Let's go beyond, now if we look at all the groups ? What will you expect ? We can analyse this data with boxplots 

If we look closely, the sons expected value seem to be following a line. Francis Galton showed that

Where rho is the correlation between two variables. If we get this into the standard form of a line we get the Regression Line, that is used to predict the value of y given x. Using the Regression Line is so much better then making a prediction based on the conditional value because it uses all the data, not just the conditioned value. We can calculate this values with our data and plot the Regression Line now.

Using our Regression Line we can predict for a father with 72 inches, a son would have 71.9 inches. But wait, when does this Regression Line would make a good prediction on our data? We have to be carefully using it, to make the case, we can use the Anscombe's Quartet, that shows for 4 different datasets, we can have the same correlation coefficient. Just in some of then it would be a good idea to use the Regression Line to make a prediction. 

Next I'll be making more posts exploring when is a good case to use based on bivariate normal distributions.  

Code on Github

要查看或添加评论,请登录

Jhonatan da Silva的更多文章

  • Does your dataset have bias?

    Does your dataset have bias?

    One of the problems with machine learning is that is heavily based on datasets that were labeled by people, and people…

  • When is a good idea to use Linear Regression?

    When is a good idea to use Linear Regression?

    Let's suppose you're analysing a dataset and fit a Linear Regression to it, how do you go about knowing if that the…

  • Are 4th graders better at math then 8th graders?

    Are 4th graders better at math then 8th graders?

    First of all let me tell you why I'm exploring this dataset in particular, feel free to jump this if you want to have…

  • Do you trust your gut? A probabilistic approach.

    Do you trust your gut? A probabilistic approach.

    I was having some conversations about the Monty Hall experiment and got quite surprised on how many people even after…

社区洞察

其他会员也浏览了