Path to Data science - Zero to Hero Series 1 - Week1
RAHUL Muralidharan
Datascience enthusiastic | IIM Calcutta | IIT Patna | Data governance | Communication and Trade Surveillance
Let’s start with the zero- week1
5 imp Statistics Topics for Data Scientists:
Statistics is a form of mathematical analysis that uses quantified models and representations for a given set of experimental data or real-life studies. The main advantage of statistics is that information is presented in an easy way. Recently, I reviewed all the statistics materials and organized the 8 basic statistics concepts for becoming a data scientist!
1. PROBABILITY DISTRIBUTIONS
2. HYPOTHESIS TESTING
3. ANALYSIS OF VARIANCE (ANOVA)
4. REGRESSION ANALYSIS
5. DIMENSIONALITY REDUCTION TECHNIQUES
This week, we're diving headfirst into the fascinating world of data distribution—the cornerstone of understanding the spread of data! Together, we'll navigate through histograms, probability density functions, and so much more, all in pursuit of gaining deeper insights into our data.
But before steeping in to the concept, what is your thought on Inference.
WHAT IS INFERENCE?
Inference is the process of drawing reasoned but risky conclusions from empirical evidence. I enter a room and see a person with a smoking gun in their hand beside a body with gunshot wounds. I could infer that the person holding the gun had shot the other one. Other evidence might be relevant to my conclusion. I might know the two people were mortal enemies. I might just have heard a gunshot. However, my inference and the conclusion I draw from it would be a risky one because there is only some probability that it is correct. Perhaps the body is a suicide, and the person now holding the gun had been desperately trying to wrestle it from the victim. I did not witness the shot directly. Even if I had, I would still need to be sure that it was not some visual trick or illusion and that everything was indeed as it seemed on the surface. This is the situation we face with most evidence. Many of the processes we try to understand are invisible. We cannot ‘see’ class, ethnic discrimination, economic growth or the rise of populism directly – if we could, there would be little need for social science – rather, we can collect evidence about the results of these processes and build models of what we think may be happening to produce that evidence. That is what scientific inference comprises. It is helpful to think of three categories of inference: (1) informal inference, which is something intuitive we do all the time; (2) scientific inference, which is a set of rules laid down to minimize the risk of drawing unsound conclusions; and (3) statistical inference, which is the part of scientific inference that deals with generalizing evidence taken from samples to the whole populations from which these samples have been drawn. Almost all the evidence we ever work with is a sample of some kind, so that statistical inference is a fundamental part of the scientific method.
Cognitive illusions
Cognitive illusions, as described by Kahneman (2011), are systematic biases that affect our everyday reasoning processes, akin to visual illusions that trick our vision. These biases lead us to rely too heavily on easily available evidence, substitute difficult questions with easier ones, avoid numerical calculations, see patterns where none exist, and exhibit confirmation bias by favoring evidence that supports our existing beliefs. Such cognitive tendencies have historically given rise to various fanciful beliefs and misconceptions in societies.
To counteract these biases, scientific inference is employed, which emphasizes basing conclusions on empirical evidence and testing theories against this evidence. Unlike informal inference, scientific inference minimizes individual biases by using evidence in a controlled manner. However, even scientific inference acknowledges the provisional nature of knowledge, as new evidence may lead to revisions or improvements in existing theories.
Statistical inference plays a crucial role in the social sciences, where data are often obtained from samples rather than entire populations. This approach allows researchers to generalize findings from samples to larger populations or even to future scenarios. Statistical inference involves estimating population characteristics based on sample data or testing hypotheses about population parameters.
Overall, cognitive illusions highlight the need for rigorous methods such as scientific and statistical inference to mitigate biases and ensure more accurate understanding and interpretation of empirical evidence.
Pre-statistical era
Before the advent of statistics, our understanding of the world was severely limited. It wasn't until around a century ago that statistical inference emerged, allowing us to glean insights about vast populations from remarkably small samples of data. Without this crucial development, our ability to comprehend and manage various aspects of the world, from medicine to economics to societal structures, would have been greatly hindered.
The transition to a statistical worldview began in the 19th century, driven by the recognition of the need for quantitative information to effectively govern societies. Population censuses became essential tools for states, albeit cumbersome and costly. Throughout the 20th century, the refinement of sample surveys furthered our ability to collect data, although comprehensive surveys of social and political conditions were sparse before the 1970s.
The pre-statistics era was marked by static conditions, where knowledge was largely localized and change occurred incrementally. However, the rise of scientific thinking and logic fundamentally transformed this landscape, paving the way for advancements such as industrialization and economic progress. Understanding scientific logic, including statistical inference, has become indispensable for comprehending the modern world, as it enables us to draw meaningful conclusions from limited data and grasp the intricacies of complex systems.
Lets start with PROBABILITY DISTRIBUTIONS, however before getting in lets understand what is probability
领英推荐
WHAT IS PROBABILITY?
Probability deals with estimating the likelihood of events or outcomes that may or may not occur, or processes that involve uncertainty or randomness. Whether it's predicting the weather, estimating the number of people in a crowd, contemplating the existence of intelligent life elsewhere in the universe, or deciding whether to wait for the bus or walk, probability is at play. It encompasses both our subjective beliefs and objective uncertainties about the world.
Randomness can stem from our incomplete knowledge, such as when dealing cards from a shuffled deck, or it may be inherent in the process itself, like when spinning a roulette wheel or rolling dice. Understanding probability involves acknowledging and quantifying this randomness.
It's essential to distinguish probabilistic insights, where we estimate the likelihood of events before they occur, from hindsight, where events have already unfolded and their outcomes are fixed. Probability provides a framework for making informed decisions in uncertain situations and is fundamental to various fields, from science to everyday decision-making.
Thinking about the world in terms of probabilities rather than causal narratives requires effort and practice, as our brains aren't well-equipped to handle randomness. A simple thought experiment, like choosing a number at random between 1 and 10, highlights our inherent difficulty with randomness. Despite our efforts, certain numbers like 7 are often chosen more frequently due to their cultural associations, indicating our inability to truly select random options.
Historically, humans have engaged in games of chance for centuries, but the concept of analyzing unpredictable events numerically emerged only three centuries ago. Before then, the world was often divided into the known and predictable, governed by natural laws, and the unknown or fateful, beyond human understanding. However, the recognition of the unpredictability of events led to the scientific exploration of probability.
While individual random events are unpredictable, the law of large numbers dictates that the distribution of outcomes from a large number of such events tends towards predictable proportions corresponding to underlying probabilities. This paradox forms the basis of probability theory. Despite the inherent unpredictability of individual events, probability allows us to make informed estimations about the likelihood of outcomes based on statistical analysis.
Am assuming that you guys are aware of the data types in stats: am not covering here. However to know further on data types please refer below
Let’s look into what is distribution?
Imagine you're in a candy shop, You walk in and notice jars filled with different types of candies. Some candies are popular and sit in large piles, while others are more rare and scattered around the shelves.
As you look closer, you notice a pattern. The most popular candies, like chocolate bars and gummy bears, are clustered together in the centre of the store, forming a big mountain of sweetness. These are like the peaks of our distribution.
But as you move away from the center, you see candies that are less popular, like liquorice or sour candies. They're still there, but they're spread out, like a trail leading away from the mountain.
That's just like a distribution! The popular candies form the peak or centre of our distribution, while the less popular ones spread out towards the edges.
?Just like in our candy shop example, in real-life data, you might see a lot of values clustered around a central point, with fewer values as you move away from that point. Understanding these patterns helps us make sense of our data and find the sweet spots where the most important information lies!
What is Probability Distribution?
The probability distributions are a common way to describe, and possibly predict, the probability of an event. The main point is to define the character of the variables whose behaviour we are trying to describe, trough probability (discrete or continuous). The identification of the right category will allow a proper application of a model (for instance, the standardized normal distribution) that would easily predict the probability of a given event.
Probability distributions play a significant role in understanding the relationships between random variables and predicting the outcomes of different processes. One of the primary purposes of a probability distribution is to provide insight into the relationships between random variables. By examining the shape of a particular distribution, we can determine whether a specific set of values is more likely than another set. It can help us predict the likelihood of a specific outcome occurring based on given data.
Another use of probability distributions in data science is determining the range of possible values?within a given data set. By analyzing the shape of the probability distribution, we can determine the minimum and maximum values that are likely to occur. We can also use this to estimate the probability of different values appearing in a set of data.
Different type of Distributions:
The behavior of probability is linked to the features of the phenomenon we would predict. This link can be defined as probability distribution. Given the characteristics of phenomena (that we can also define variables), there are defined probability distribution. For categorical (or discrete) variables, the probability can be described by a binomial or Poisson distribution in the majority of cases. For continuous variables, the probability can be described by the most important distribution in statistics, the normal distribution. Distributions of probability are briefly described together with some examples for their possible application.
Lets catch up later on with the in-detail discussion on the different type of distribution and its example.