What's What?

What's What?

The high school statistic somewhere never made it to the list of my favorite subjects.

And when I look at myself so much interested, I find it really amusing. A big thanks to my professor at Deakin who actually made me fall in love with the subject. He applied practical first approach before theories.

E.g. He would start with something like,

A company says we are 95% confident that our product is loved by our customers. So, where does the confidence come from?”

And then it would really make me curious, “yes, where did the confidence come from? Then would 100% confidence be possible as well?” The opening line kept me curious through the lecture to find my answers.

Realising this, I will try to keep my statistics article series free from theories as much as possible, easy and practical enough for you to retain your attention. (I will however mention the technical term in brackets to represent the information for proper structuring).

And why Statistics? and why not Python, SQL, Excel, or other tools? Because they are just TOOLS! And to apply the tools, you need to know where to and how to apply one! Statistics and maths are the knowledge you apply to real-life problem using tools like Excel, Python and SQL and communicate findings to people using Power BI or Tableau. So, our first step will be to get the required knowledge! THE FOUNDATION.

The Basics:

Problem: Out of 600 students in my class, I want to find out how many students regularly attend the stats class.

Approaches I can take (Sources of Data):

  1. Work hard myself to know the answer (Primary data) I can secretly note down the attendance of students in my class. (Direct observation) I can ask a bunch of students to stay after class and ask them how often they attend the class (focus group) I can announce a $100 reward to a few students (test group) who attend stats class 5 days in a week and note the attendance of these and the remaining (control group) students. (experiment) I could prepare a Google form asking Student Id and attendance habits and ask every individual to fill it up (survey)

I had to secretly note down the attendance in the first method because in direct observation subject of interest need to be in their natural environment, unaware of the fact that they are being watched. The focus group however allows subject of interest know they are being watched. The experiment is actually conducted to see if any kinds of interventions (rewards or penalties) affect our results. In above example, the students who were exposed of $100 offer are the test groups and the who aren’t are the control group. By comparing the attendance difference, we could actually see if the incentive worked for the students. The last one is survey which can be administered by various online and offline techniques.

However, monitoring 600 students, knowing all their names and asking them individually will take me forever. Hence, this defines the limitations of primary data source. Although they offer high accuracy and are more specific to our problem, primary data collection often time can be time consuming, tiring and costly.

  1. Ask for the answer with my Professor (Secondary data)

The other option I have is to request my professor to let me have a look at his attendance sheet. Simple as that!

The attendance sheet might have the Professor’s own code of marking like just ticks for the ones who attended the class and cross for those who didn’t. I will then have to go all the way back to count the ticks and crosses for each students to find out my answers.

This is the problem with secondary data. They might not be specifically tailored to our need and further effort might be required. However, still better than me having required to note down everyday attendance of each students myself.

Problem: Students of which gender (male of female) are attending more stats classes?

The professor decided to give me access to the stats student database after I promised to make its ethical use. The data in the database was overwhelming. I could notice the following:

  1. Name of student, Gender, Stats subject rating, Stats teacher rating (Qualitative data)
  2. Student’s age, height, First assesment score, Attendance score (Quantative data)

The major way you can distinguish qualitative data from quantative is by using mathematical operations. Adding name of students or subtracting gender will never make sense. Meanwhile, average age of students in the class or their average attendance score can be meaningfully interpreted.

Although average number of male or female in the class might make them look like quantative data, but we have used the “count of male” or “count of female” than the values “male” or “female” itself. Converting qualitative data into numerical data for proper use is a popular norm in data analysis called encoding/coding/labeling.

So, you must be wondering what kinds of mathematical operations can I perform on each data? How can you measure them? Let’s have a look at them individually: (Level of measurement)

  1. Name, Gender: They simply represent the labels and no values or order or ranks (Nominal level). We can measure them by counting their frequencies, e.g. there are 400 females and 200 males in the stats class.
  2. Stats subject rating, Stats teacher rating: Students rate the stats course and teacher out of 5 start rating. Students with 4 or 5 stars like the subject and teacher more compared to the students who gave 1 or 2 stars. Hence, the ranking of 1 star to 5 stars is meaningful (ordinal level). But we cannot infer the student giving 5 star exactly likes the subject and teacher 5 times more than the student giving 1 star, since this is a personal interpretation. Hence, we can measure these kinds of data using median, mode and ranking.
  3. Student’s age, height: We can measure the age and height using frequencies as well as median and mode, but a more meaningful intrepretation could come if we classified them into different groups. (Interval level) E.g. we can identify how many students are below 25 years, how many between 25 to 30 and how many above 30 in our class. We can then perform any kinds of calculation mentioned above, also check how much age deviated in compared to the mean age, etc. One interesting thing to note here is, age 0 doesn’t mean the a person doesn’t exist at all (could be a 4month old infant), there is no true 0 in interval level.
  4. First assesment score, Attendance score: We can perform any mathematical operation in this data. This is the highest level of measurement (Ratio Level). Unlike ordinal level, the ranking is meaningful here, e.g. student who scores 50 compared to the student who scored 10 actually performed 5 times better. Also, unlike interval level, true 0 can exist in the level, e.g. student who score 0, actually scored nothing in the exam.

Problem: To have a proper understanding of the attendance, I cannot make a judgement by simply looking at a single day’s data. Hence, I can do the following:

  1. Look at the trend of attendance on stats subject over past 5 years (Time series data)
  2. Look at the total number of attendance of male and female at the end of the semesters (Cross sectional data)

The time series data as mentioned in the example above measures same variable (attendance on stats subject) over a long period of time (5 years). Such kind of data help us identify trends, seasonal patterns or forecasts over time. Similarly, cross sectional data are collected at the same point (end of semesters) across multiple entities (male and female). This kind of data helps us in making comparisons or use for benchmarking.

Problem: Finally, I get hold of the data I wanted to understand, what next? Where do I use it in real life?

Yes, the data collection and understanding is the fundamental step, but our major purpose is to extract meaningful information from them and use it for greater purposes. Here is how I can use the data I obtained for better use:

  1. I can use the data to explain and summarize it just as it is, without giving my opinion so that people get a basic understanding of what is being talked about in a meaningful way. (Descriptive statistics)

It could be something like finding an average number of male and female attendees in stats class, the average age of students attending the class, showing the male and female counts in the class using bar charts, etc. The major purpose is to summarize the big data into a simple picture. This might lead to loss of information but gives a basic overview of data construction for further analysis.

  1. I can use the data to predict the number of student attendance in the future and suggest the professor plan the capacity accordingly. Or I can claim that students of a certain gender usually attend class more in all subjects. Or I can conclude
  2. Age or gender factors affecting students' attendance in stats class. (Inferential statistics)


Stats basic concepts
Stats basic concepts summarized (Design created using ChatGPT 4o)

It is understandable that when describing a data (descriptive statistics) or making conclusions or predictions using them (inferential statistics), we want to include all the data being analysed. I mean, to find out the attendance status of all 600 students in my stats class, I would actually want to collect data of all 600 students. However, this might not always be possible. What if I want to calculate the attendance status of the entire university with more than 10,000 students? What if I want to predict what will be the attendance of students who will be joining the university in 2025? It is not possible for me to have every single data in my finger tips! This is where Statics plays the major role!

It’s the beauty of statics that allows us to answer our questions with handful of representative data (sample) we have to generalize about the entire data (population) we are interested in. We can simply select a handful number of students and work on them to draw conclusion about the entire population. This approach definitely isn’t 100% correct as we are not including the entire population of interest. Hence is the tradeoff. The larger the accuracy we want, the larger should be the sample size, and larger we will be confident about our conclusion. However, given the limitation of time and resources, we can trade off a little accuracy over ease.

But how do I select the students to generalize attendance of my entire class? If I select only the ones who always attend the class, my conclusion will be all the students always attend the stats class which will not be true or viceversa!

So, we have different methods on how we can do it:

  1. Select students based on availability or willingness to participate (non-probability sampling)
  2. Select students randomly where anyone could be selected (probability sampling)


要查看或添加评论,请登录

Srijana Bhusal的更多文章

社区洞察

其他会员也浏览了