What's What?
The high school statistic somewhere never made it to the list of my favorite subjects.
And when I look at myself so much interested, I find it really amusing. A big thanks to my professor at Deakin who actually made me fall in love with the subject. He applied practical first approach before theories.
E.g. He would start with something like,
A company says we are 95% confident that our product is loved by our customers. So, where does the confidence come from?”
And then it would really make me curious, “yes, where did the confidence come from? Then would 100% confidence be possible as well?” The opening line kept me curious through the lecture to find my answers.
Realising this, I will try to keep my statistics article series free from theories as much as possible, easy and practical enough for you to retain your attention. (I will however mention the technical term in brackets to represent the information for proper structuring).
And why Statistics? and why not Python, SQL, Excel, or other tools? Because they are just TOOLS! And to apply the tools, you need to know where to and how to apply one! Statistics and maths are the knowledge you apply to real-life problem using tools like Excel, Python and SQL and communicate findings to people using Power BI or Tableau. So, our first step will be to get the required knowledge! THE FOUNDATION.
The Basics:
Problem: Out of 600 students in my class, I want to find out how many students regularly attend the stats class.
Approaches I can take (Sources of Data):
I had to secretly note down the attendance in the first method because in direct observation subject of interest need to be in their natural environment, unaware of the fact that they are being watched. The focus group however allows subject of interest know they are being watched. The experiment is actually conducted to see if any kinds of interventions (rewards or penalties) affect our results. In above example, the students who were exposed of $100 offer are the test groups and the who aren’t are the control group. By comparing the attendance difference, we could actually see if the incentive worked for the students. The last one is survey which can be administered by various online and offline techniques.
However, monitoring 600 students, knowing all their names and asking them individually will take me forever. Hence, this defines the limitations of primary data source. Although they offer high accuracy and are more specific to our problem, primary data collection often time can be time consuming, tiring and costly.
The other option I have is to request my professor to let me have a look at his attendance sheet. Simple as that!
The attendance sheet might have the Professor’s own code of marking like just ticks for the ones who attended the class and cross for those who didn’t. I will then have to go all the way back to count the ticks and crosses for each students to find out my answers.
This is the problem with secondary data. They might not be specifically tailored to our need and further effort might be required. However, still better than me having required to note down everyday attendance of each students myself.
Problem: Students of which gender (male of female) are attending more stats classes?
The professor decided to give me access to the stats student database after I promised to make its ethical use. The data in the database was overwhelming. I could notice the following:
领英推荐
The major way you can distinguish qualitative data from quantative is by using mathematical operations. Adding name of students or subtracting gender will never make sense. Meanwhile, average age of students in the class or their average attendance score can be meaningfully interpreted.
Although average number of male or female in the class might make them look like quantative data, but we have used the “count of male” or “count of female” than the values “male” or “female” itself. Converting qualitative data into numerical data for proper use is a popular norm in data analysis called encoding/coding/labeling.
So, you must be wondering what kinds of mathematical operations can I perform on each data? How can you measure them? Let’s have a look at them individually: (Level of measurement)
Problem: To have a proper understanding of the attendance, I cannot make a judgement by simply looking at a single day’s data. Hence, I can do the following:
The time series data as mentioned in the example above measures same variable (attendance on stats subject) over a long period of time (5 years). Such kind of data help us identify trends, seasonal patterns or forecasts over time. Similarly, cross sectional data are collected at the same point (end of semesters) across multiple entities (male and female). This kind of data helps us in making comparisons or use for benchmarking.
Problem: Finally, I get hold of the data I wanted to understand, what next? Where do I use it in real life?
Yes, the data collection and understanding is the fundamental step, but our major purpose is to extract meaningful information from them and use it for greater purposes. Here is how I can use the data I obtained for better use:
It could be something like finding an average number of male and female attendees in stats class, the average age of students attending the class, showing the male and female counts in the class using bar charts, etc. The major purpose is to summarize the big data into a simple picture. This might lead to loss of information but gives a basic overview of data construction for further analysis.
It is understandable that when describing a data (descriptive statistics) or making conclusions or predictions using them (inferential statistics), we want to include all the data being analysed. I mean, to find out the attendance status of all 600 students in my stats class, I would actually want to collect data of all 600 students. However, this might not always be possible. What if I want to calculate the attendance status of the entire university with more than 10,000 students? What if I want to predict what will be the attendance of students who will be joining the university in 2025? It is not possible for me to have every single data in my finger tips! This is where Statics plays the major role!
It’s the beauty of statics that allows us to answer our questions with handful of representative data (sample) we have to generalize about the entire data (population) we are interested in. We can simply select a handful number of students and work on them to draw conclusion about the entire population. This approach definitely isn’t 100% correct as we are not including the entire population of interest. Hence is the tradeoff. The larger the accuracy we want, the larger should be the sample size, and larger we will be confident about our conclusion. However, given the limitation of time and resources, we can trade off a little accuracy over ease.
But how do I select the students to generalize attendance of my entire class? If I select only the ones who always attend the class, my conclusion will be all the students always attend the stats class which will not be true or viceversa!
So, we have different methods on how we can do it: