When is a good idea to use Linear Regression?
Let's suppose you're analysing a dataset and fit a Linear Regression to it, how do you go about knowing if that the correlation is a good fit to our data? The same correlation value in Anscombe's Quartet can have huge differences across different datasets. It's a good ideia when our data is approximated by a bivariate normal distribution.
But what is a Bivariate Normal Distribution?
Let's go back to the heights of Galtons family dataset. When you have a pair of variables, make the fathers height and the sons height for Galton's family. To have a bit of fun you can follow my argument and chose your height in inches ( maybe your fathers height? ). Did you chose? Ok then, let's go to the next step. Now that you've chosen a height, we can then pick all the heights of the sons for this value.
What happens to this distribution ?
What are your thoughts on the distribution of sons height for the fathers height that you've chosen? Is it a beta distribution? Normal ? Uniform? It's normally distributed, we can verify this if a Quantile-Quantile Plot
galton_heights %>%
filter(round(father) == your_fathers_height) %>%
ggplot() +
stat_qq(aes(sample=son))
We see that the data seems to be following a line. But this is only for when the father has height rounded to 72. We say that
If X and Y are normally distributed random variables, and for any group of X, say X=x, Y is approximately normal in that group, then the pair is approximately bivariate normal.
We did just that, we picked a value for X ( fathers height ), say X = 72, and then checked if Y was normally distributed. If we look at all groups and see this same trend we can say that our variables follows a bivariate normal distribution
galton_heights %>%
ggplot() +
stat_qq(aes(sample=son)) +
facet_wrap(~father_strata)
If we reach at this point we can say it's a good idea to use Linear Regression in our dataset.