When is a good idea to use Linear Regression?

When is a good idea to use Linear Regression?

Let's suppose you're analysing a dataset and fit a Linear Regression to it, how do you go about knowing if that the correlation is a good fit to our data? The same correlation value in Anscombe's Quartet can have huge differences across different datasets. It's a good ideia when our data is approximated by a bivariate normal distribution.

But what is a Bivariate Normal Distribution?

Let's go back to the heights of Galtons family dataset. When you have a pair of variables, make the fathers height and the sons height for Galton's family. To have a bit of fun you can follow my argument and chose your height in inches ( maybe your fathers height? ). Did you chose? Ok then, let's go to the next step. Now that you've chosen a height, we can then pick all the heights of the sons for this value.

What happens to this distribution ?

What are your thoughts on the distribution of sons height for the fathers height that you've chosen? Is it a beta distribution? Normal ? Uniform? It's normally distributed, we can verify this if a Quantile-Quantile Plot

galton_heights %>% 
  filter(round(father) == your_fathers_height) %>% 
  ggplot() + 
  stat_qq(aes(sample=son))

We see that the data seems to be following a line. But this is only for when the father has height rounded to 72. We say that

If X and Y are normally distributed random variables, and for any group of X, say X=x, Y is approximately normal in that group, then the pair is approximately bivariate normal.

We did just that, we picked a value for X ( fathers height ), say X = 72, and then checked if Y was normally distributed. If we look at all groups and see this same trend we can say that our variables follows a bivariate normal distribution

galton_heights %>%
  ggplot() +
  stat_qq(aes(sample=son)) +
  facet_wrap(~father_strata)

If we reach at this point we can say it's a good idea to use Linear Regression in our dataset.


要查看或添加评论,请登录

Jhonatan da Silva的更多文章

  • Does your dataset have bias?

    Does your dataset have bias?

    One of the problems with machine learning is that is heavily based on datasets that were labeled by people, and people…

  • The birth of correlation and regression

    The birth of correlation and regression

    Suppose that you are in the XIX century and for some reason you want to answer how much of a son's height can you…

  • Are 4th graders better at math then 8th graders?

    Are 4th graders better at math then 8th graders?

    First of all let me tell you why I'm exploring this dataset in particular, feel free to jump this if you want to have…

  • Do you trust your gut? A probabilistic approach.

    Do you trust your gut? A probabilistic approach.

    I was having some conversations about the Monty Hall experiment and got quite surprised on how many people even after…

社区洞察

其他会员也浏览了