How to sound intelligent being clueless in statistics ??

How to sound intelligent being clueless in statistics ??

Have you ever found yourself feeling confused during discussions about statistics, modeling, or data analysis? Perhaps many of us struggled to keep pace with intricate scientific jargon, leaving you feeling a bit lost.

If that sounds familiar, we’ve got you covered – below you find a list of several statistical terms that may help you overcome these challenges.

Knowing these terms will definitely expand your confidence in data-related conversations and make you feel the logic of statistics.

source: tenor.com/pl/view/clever-funny-gif-23637370

So, let's embark on this journey, beginning with the foundational concepts.

Distribution

Just as a doctor first collects medical history of a patient, every (good) data analyst first looks at the distribution of data before modelling.

In a human language it means to check how the studied variable is spread around the sample.

It is just as simple as creating a histogram plot and looking at what are the minimum and maximum values, what are the mean, median and mode, how centered are the values around the mean.


In fact, wages follow the lognormal distribution but for the sake of example we keep it simple and only assume normal distribution (centered around mean).?

Consider this scenario: we're examining wages in Poland, and there are two possible salary distributions as shown in the plot above. A statistician can quickly observe that in the blue distribution, wages are more widely spread out. This indicates that there is greater variability in income across the sample compared to the yellow distribution, where the values are centered around the mean. In the blue case, the evidence would imply that some people earn o lot more than others and considerable gaps in income are persistent for this society. In contrast, in yellow case salary differences are not as dramatic, so income is more equally distributed.??

This way, distribution plots are the starting point in any data analysis.


source: sid-sharma1990.medium

Population vs Sample

If you furrowed your brows at the word sample above, we help. Following our example, researching wages in Poland, for the calculations theoretically we should have collected the data about every citizen in Poland, i.e. about the population.

Practically, it is not necessary. The laws of statistics tell us that if we study adequately large fraction of population (i.e. the sample of it), the estimates on this sample will approximate closely the values of the whole population.

This way, population is the whole variety of inspected objects (people, firms, animals, etc.). Sample is just a fraction of population.

Though, so that the approximated estimates were generalizable the sample should be as best representative of the whole population as it can be.

It means that the gathered data should be as best close to the population with respect to the characteristics that may affect the studied parameter.

For example, for calculating average wage in Poland it would be substantial to maintain the demographic balance in a sample: the proportions of age groups, gender, professions, cities, etc.

Confidence Interval

When it comes to making claims about whether something is true or false, statisticians prefer to be very cautious and skeptical (masters of ambivalence).

If any hypothesis is made, they never state that it is accepted.

Hypothesis can only be rejected or not rejected. Be cautious!?

The same logic applies to calculations.

Having estimated the sample mean of wages in Poland, it is not correct to state that the value obtained is the mean of the whole population.

As we discussed, sample estimates only approximate the true population values. Thus, statisticians provide an interval where the true population value could be, which they call confidence interval.

P-value

Expanding on the topic of hypotheses, you might have already found yourself wondering how to reach a meaningful conclusion.

In stats, there is a p-value that makes a decision. Its definition sounds as follows:

Yes, we know how shady it sounds.

source: giphy.com//gifs/netflix

But one example may help you better understand the metaphysics of p-value.

Imagine you are a detective investigating a tricky murder case in a house.

You catch a suspect red-handed at the crime scene with blood on their clothes and their fingerprints on the murder weapon.

These are the clues that make you think there's a good chance this person is the culprit.

source: giphy.com

However, as a good detective you follow the rule “innocent until proven guilty”, and you keep the null hypothesis that the suspect is innocent.

After having collected all the clues, you ask yourself a question.

What is the probability of observing all off these facts speaking against the suspect (or perhaps even more unlikely pieces of evidence), providing that the suspect is, in fact, innocent?

This probability is exactly the p-value itself.

In other words, what is the likelihood that the suspect did not kill anyone even though caught in place with a gun and blood on his hands, or even with more compelling hard evidence???

When the p-value is small, it indicates that the chances of seeing the observed results, if the null hypothesis were correct, are quite low.

And if the p-value falls short of the predefined significance level set by a statistician (which could be the probability of 10%, 5%, 1%, etc.), it leads to the rejection of the null hypothesis.

So, to reject the hypothesis that the suspect is not the murderer, it is good to have p-value approaching 0.

Zero p-value is the indicator of highest statistical significance, in particular meaning that there is no chance to observe such data if the null hypothesis is true.

source: memecrunch

Regression

So far we mentioned the terms of descriptive statistics, keeping it simple.

However, the true game starts with wrangling to make predictions upon available data.

In this case, one seeks to predict the unobserved parameter with the set of known features, like education level, age, gender, family status to predict the expected salary.

For better understanding how regression works in detail, please check our previous article here . The following terms will be related specifically to regression.

Coefficient of Determination (R-squared)

Once you've constructed a regression model, it's crucial to assess how effectively it captures and represents the data.

One of the most commonly used metrics for this purpose is the R-squared.

source: makeameme

R-squared quantifies how well the combination of predictor variables (the e.g. age, gender) explains the variation in the dependent feature (the variable to be predicted, e.g. salary).

In other words, it measures the proportion of the outcome's variability that can be accounted for by the predictor variables.?

For instance, consider two scenarios illustrated in the plots below. In the plot on the right, R-squared reaches a higher value as the data points closely align with the regression line, indicating a better fit of the model. In contrast, the plot on the left portrays a similar linear trend, but the widely scattered data points make it challenging to enhance the model's fit.?

When R-squared equals 1, it means the line fits the dots perfectly, and the model effectively explains the data. R-squared is primarily influenced by two key factors:?

  1. The Model Itself: The complexity of the model matters. A simpler model with fewer parameters might not fit the data as closely as a more complex one. For instance, if you initially predict wages using only age and gender, and then add another important factor like education level, the model becomes more robust, and R-squared increases accordingly.?
  2. The Data: Real-world data often doesn't strictly follow a linear relationship; there's always some inherent randomness (as in the left plot). Consequently, adding more parameters doesn't always enhance the model significantly.?

Therefore, in evaluating the goodness of fit, it's crucial to strike a balance between model complexity and the risk of overfitting.

It's essential to understand the uniqueness of the data and ensure that the model reflects real-world phenomena rather than aiming solely for the highest possible R-squared value.?

source: lancaster.ac.uk

Autocorrelation

To ensure the accuracy of a regression model, several important assumptions must be satisfied. One of these assumptions relates to the independence of the observations used for modeling (lack of autocorrelation).

In simpler terms, think about a situation where we're studying whether people prefer to live in Warsaw or Krakow. It might be the case that our sample includes members of the same family.

In turn, if one family member prefers to live in a particular city, it's quite likely that another family member might have the same preference as long as they would like to live in the same place.??

Consequently, when a respondent’s choice somehow influences the choice of another respondent in the dataset, it creates a connection (correlation) between these observations. This means that if our predictions have a big error for one family member, it's probably going to be a similar error for another family member. This leads to a consistent pattern of errors in the model's results, which we call autocorrelation.??

Due to that issue, there is a risk that our model won't be as good at predicting new data, and our estimates of how things are related might not be accurate. So, it's essential to consider and address autocorrelation when working with this kind of data.

source: meme-arsenal

Homoscedasticity

Homoscedasticity is one of the assumptions we make when using a linear regression model.

Since our predicted values are approximations and not perfect, there are always errors in our predictions for each data point. These errors, called residuals, should be evenly scattered across the range of predicted values.?

For instance, in the context of predicting salaries, it means that as salaries increase, the prediction errors (residuals) should not grow larger as well, as shown in the left plot. When this happens as in the plot from the right, it's called heteroscedasticity, and it violates the assumption that the error's variability in the model should stay the same across all observations.

In simpler terms, it means our model's errors should be consistent, not getting bigger or smaller as we look at different data points.?

Conclusion?

Summing up, statisticians are the true detectives of data.

They have a unique way of looking at the world, governed by the healthy skepticism and brutal caution. Being a statistician means truly understanding that the real world data is complex and messy; and uncovering meaningful patterns and relationships requires double checking the assumptions and verifying hypotheses.??

Statistics is honorable. Statistics stands by it’s words. ?



要查看或添加评论,请登录

Dominik Ogonowski的更多文章

  • Linear regression - still a Queen?

    Linear regression - still a Queen?

    Linear Regression Unravelled: Demystifying the Algorithm and Leveraging its Benefits Last year Data Juice Lab did a…

  • When Excel Falls Short: Solutions for Managing Large Data Sets

    When Excel Falls Short: Solutions for Managing Large Data Sets

    Our experience reveals that in many firms Excel is the most widely utilized tool for business tasks, including data…

    1 条评论
  • Hertz offers $1bn in fresh stock issue

    Hertz offers $1bn in fresh stock issue

    In case you have missed it, there it is: https://www.marketwatch.

    1 条评论
  • Rent a Car in a New Normal

    Rent a Car in a New Normal

    So, here it comes. Hertz filed for Chapter 11 on Friday, May, 22nd.

    1 条评论
  • Korzy?ci z Krajowej Chmury?

    Korzy?ci z Krajowej Chmury?

    Ciekawa dyskusja w trakcie Kongres590 dotycz?ca korzy?ci dla przedsi?biorców z Krajowej Chmury, za któr? stoi pkobp…

  • What's the difference between AI and ML?

    What's the difference between AI and ML?

    Dear Friends, As we are curious by design, from time to time, we find or write an interesting piece of knowledge. So…

    2 条评论

社区洞察

其他会员也浏览了