登录查看更多内容

Probability and Statistics in R

Leonardo A.

Data Analyst

发布日期: 2021年5月19日

In this article, we will review some of the main concepts of probability and statistics.

We will see the sampling process, Sample, event, expected value, mean, and how we can use dispersion measures such as Variance and standard deviation to help us better understand the data. We will also discuss what Pearson’s estimators, estimates, and correlation coefficient are. Finally, we’ll talk about Probability Distributions and Hypothesis Tests.

Basic Concepts of Probability

Mutually exclusive: means that we can observe only one of the possible outcomes.

The probability: the proportion in which an outcome occurs in the long run and repeats often.

Sample space: The set of all possible results of a random variable is called sample space.

Event: An event is the subset of the sample space and can consist of more products.

Roll a 6-sided Dice

In terms of a random design, this would be like randomly selecting a size 1 sample from a set of numbers that are mutually exclusive results.

In this case, the sample space is the set of all possible results of the dice release: {1, 2, 3, 4, 5, 6}.

An event would be any subset of this set. For example, we could have the following event: the result between 3 and 6.

Simulate a dice in R

For this, we can use the function Sample(). First, let's get the information about the function

N?o foi fornecido texto alternativo para esta imagem

The question marks a brief description of the function. sample() obtains a sample of a specific size from a set of elements with or without substitution.

?sample

To use this function, we must inform the set of elements and the sample size.

The first parameter is the set of elements, our sample space from 1 to 6. Now, tell us how many samples we want. Initially, we will make only 1 release:

sample(1:6, 1)
4

If we run this code more than once, we will have different results. This would be similar to the actual release of a given. If we wanted to simulate more than one release, it would be enough to exchange the value 1 for the number of throws we wanted:

sample(1:6, 4)
1 2 5 6

By default, the sample function treats our experient as unreplaced. This means that the face was 4 in the first release; on the second release could come out any number, except the 4.

We know that this does not make much sense in a dice release; nothing prevents the same result from being the same in an N sequence. An unreplaced draw is appropriate in a lottery draw, where the amounts were drawn cannot be repeated.

Therefore, to make our data release program closer to reality, we set the replace parameter to True — replacement experiment:

sample(1:6, 4, replace = TRUE)
1 2 4 5 5 5 6 4 3 3

Now we can roll the data 10x and observe the actual simulation of a release.

Expected Value, Average, Variance, and Standard Deviation

The expected value of a random variable can be described, in a simplified way, as being the long-term average value of its results when the number of repeated attempts is large.

We could calculate the expected value of the release of 6-sided dice:

This means that if we released the data repeatedly, we would have an average result of 3.5. Obviously, we would never have a broken value as a result, but on average, the value obtained would be 3.5. Let’s use the sample() function again, next to the mean() function.

Roll the dice N Times

The code will calculate the average of the results of 88 postings of a given:

mean(sample(1:6, 88, TRUE))
3.3

Each time we run the code, the result will be different. R simulates random result generation. As we can see, the output does not reach the expected average value of 3.5.

We’ll increase the number of postings to 10,000 times:mean(sample(1:6, 1000, TRUE))

3.501

Now, the result value has come much closer to 3.5. If we further increase the number of launches to millions:

mean(sample(1:6, 1000000000, TRUE))
3.499943

Depending on the capacity of the computer, the code may take a little longer to run. As we can see from the result above, the product is even closer to 3.5.

Dispersion Measures

In addition to the Mean, two other concepts are essential when we work with probability and statistics — Variance and standard deviation. Both are considered dispersion measures of a random variable.

Imagine we have 2 values: {1 and 1,000}. We calculate the average between them:

(1 + 1000)/2
500.5

Imagine we have 2 other values: {500 and 501}:

(500 + 501)/2
500.5

The Mean in both cases is the same, but it would not be prudent to use only the mean to describe the circumstances. That’s why, in addition to the average, we usually use dispersion measures.

Variance

For each of the values, we subtract the value of the element and square it. Then we add all results and divide by the number of samples. In both, we have an average of 500.5:

((1 - 500.5)^ 2 + (1000 - 500.5)^2)/2
249500.25

We can see that the Variance in the first case is much higher than in the second case.

The Variance of a random variable is considered a measure of its statistical dispersion and indicates how far away its values are from the expected value.

((500 - 500.5)^ 2 + (501 - 500.5)^2)/2
0.25

The lower the Variance, the better we can say that the average summary of the data.

To calculate the Variance, we squared the values. This caused the first result of the Variance to assume a thunderous value and the second Variance a minimal value.

It’s a little difficult to interpret this result when the basis of our values fluctuates between 1 and 1,000.

Standard Deviation

Another handy measure to calculate the dispersion of a die is the famous standard deviation. We can find it by calculating the square root of the Variance. In our case, the standard deviation of the first example would be the square root of:

sqrt(249500.24)
499.5

In the second example, we see that the average distance from the values to the Mean is much closer:

sqrt(0.25)
0.5

This type of result is much simpler to interpret. The mean distance between the values and the Mean in the first case is 499.5, and in the second case, it is 0.5.

Probability distributions

We’ll see some essential terms to avoid making any mistakes from now on.

Population

In statistics, the Population is a complete collection of all elements to be studied.

Sample

It refers to a part of the Population with at least one characteristic in common, related to the fact that it is desired to research. From the collected samples, it is possible to make inferences that will serve as a basis for decision-making.

Census

It’s the complete examination of the entire Population. It covers all those relating to all elements of the Population.

Parameters

These are numbers that describe the characteristics of the Population. Population Mean and Population Standard Deviation are examples of parameters.

Data Classification

We can classify the data into continuous and discrete.

Continuous Data

Without any failures or gaps between them, an infinite number of possible values present the continuous data, including the importance of a numeric range.

Discrete Data

It is a finite number of possible values that represent discrete data. For example, the number of employees in a workshop is discrete data.

Quantitative Data

This is data that represents counting or measurement.

Qualitative Data

They are data of non-numeric characteristics that the poem is separated into distinct categories.

Probability

Some important terms are used when we work with probability.

Sample Space — S {heads, tails}

It is the set of all possible results in a random design. For example, when flipping a coin, the sample space will always be the set {heads, tails} — these are the two possible outcomes when we flip a coin.

Event

It is a subset of the sample space. When a currency is posted, the event is the occurrence of this posting; an event will always be a subset of the Sample Space — S.

For the currency posting example, the subsets are A = face and subset B = crown. Both A and B are contained in S.

Right Event

When we roll a given, the face that will be up will undoubtedly be a number between 1 and 6.

Impossible Event

In the two-data set posting, the sum of the results can’t be 13 since the maximum sum of the postings is 12.

Mutually Exclusive Events

Two or more events that cannot occur at the same time.

Probability — Quantification of Uncertainty

For random events, there is always uncertainty as to whether or not an event will occur. This measure of chance, also called probability, can be represented by a number between 0 and 1.

If we are sure that an event will occur, we say we have a 100% probability of 1. If otherwise, we are confident that the event will not happen, we can state that its probability is 0 or 0%.

Sampling

Sampling consists of procedures for extracting samples that represent a given population well. We use Sampling because we would typically not obtain data related to an entire Population since this process is costly and time-consuming.

We call it the target population of which we’re going to make sample-based inferences.

Representativeness

To make valid inferences about the target population, there must be representativeness from a Sample. One of the ways to achieve representativeness is to make the choice of the Sample in a random process.

Element Extraction

As for element extraction, samples may have a replacement - a drawn element can be drawn again (data release), or without replacement - drawn element cannot be drawn more than once (lottery draw).

Sample Composition

There are two methods for commencing a sample.

Non-probabilistic sampling methods, also called intentionally, are samplings in which there is a deliberate choice of the elements that make up the Sample. The research results to a population cannot be generalized since non-probabilistic samples do not guarantee the representativeness of the Population.

Probabilistic sampling methods require that each element of the Population have a certain probability of being selected.

Pearson Correlation Coefficient

A correlation coefficient is a numeric value that represents the relationship between two or more variables. When we work with samples, we usually define the correlation coefficient with the letter R.

Pearson’s correlation coefficient, also called the product-moment correlation coefficient, is the degree of correlation between two variables. To interpret the coefficient, you must know that:

1 means that the correlation between the variables is perfectly positive
-1 means it is perfectly negative
0 means that variables do not depend on each other.

The correlation is not the same as cause and effect. Two variables may be highly correlated, and, however, there is no cause-effect relationship between them.

If two variables are tied by a causal relationship and made, they are necessarily correlated. The correlation study assumes that the variables X and Y have a normal distribution.

Probability Distribution

When we apply statistics in problem-solving, we found that many of the problems have the same characteristics, which allows us to establish a theoretical model for determining the solution of the issues.

The main components of a theoretical statistical model are:

The possible values that the random variable of X can assume;
The probability function associated with the random variable X;
The expected value of the random variable X;
The Variance and Standard Deviation of the random variable X.

There are two types of theoretical distributions that correspond to different types of data or random variables:

Discrete Distribution
Continuous Distribution.

In addition, to identify the values of a random variable, we can assign a probability to each of these values.

We have a probability distribution when we know all the values of a random variable along with their respective probabilities.

The probability distribution associates a probability with each numerical result of an experiment; that is, it provides the chance of each value of a random variable. For example, when posting data, each face has the same probability of occurrence of 1/6.

Any probability distribution has two crucial characteristics:

The probability that a given event will occur should always be greater than zero and less than 1.
The sum of all values of a probability distribution must always be precisely equal to 1.

We know that an event is a subset of a more extensive set, called sample space. Therefore, the chance of getting a specific set should always be greater than 0 and less than 1. In addition, when we add the chance of getting each of the possibilities, we will have a value exactly equal to 100%.

Cumulative Probability Distribution

The probability distribution of a discrete random variable is the list of all possible variable values and their probabilities that add up to 1.

An accumulated probability distribution function provides the probability that the random variable is less than or equal to a specific value. For example, each of the possible results of posting a data:

The probability is always the same in each of the throws, while the accumulated probability increases until it reaches 100% or 6/6.

Now, let’s run a code in R, which plots the probability of each of the cases of the release of data:

probability <- rep(1/6, 6)
plot(Probability, 
     main = 'Probability Distribution',
     xlab = 'Values')

We can see a set of points, where all have value equivalent to 1/6 or 1.666666667.

accumulated_probability <- cumsum(Probability)
plot(accumulated_probability,
     xlab = "Values", 
     main = "Accumulated Probability")

When plotting the data referring to the cumulative probability distribution, we can see that the values accumulate until they reach 100%.

Normal Distribution

The normal distribution is the most important statistical distribution, both in practical and theoretical terms. This type of distribution is bell-shaped, unimodal, and symmetrical to its Mean.

Considering the probability of occurrence, the area under the curve sums 100%. This means that the likelihood of an observation assuming the value between any two points is equal to the space between these two points.

Central Limit Theorem

The central limit theorem expands the application of the normal distribution. It states that as the sample size increases, the sample distribution tends towards a normal distribution.

We have a symmetrical, smooth type for normal distribution, whose shape resembles a bell. This distribution only has one value for Mode; that is, it is unimodal, having its maximum frequency point located in the middle of the distribution. The average, median, and mode coincide.

Normal Distribution Properties:

The random variable X can assume all real value;
The graphic representation is bell-shaped, around the average (μ);
The total area around the abscesses is equal to 1. This value is the probability of random variable X assuming any actual value in;
The likelihood of random variable X having values higher than the Mean is equal to having values less than the Mean. Both odds are 50%.

Estimators and Estimates — Statistical Inference

Estimators are functions of sample data extracted from an unknown population.

Estimates are numerical values calculated by estimators based on sample data. That is, we use estimators to find estimates. Estimators and estimates are part of what we call Statistical Inference.

The inference is a technique that allows us to extrapolate the results from samples. This means that we can make statements and draw conclusions based on partial or reduced data, and also validate the results beyond the likely limits, for example:

Hypothesis Tests — Significance Test

The Hypothesis Test is a statistical procedure that allows you to decide between two or more hypotheses.

The Hypothesis Test basically involves deciding between accepting or rejecting the null hypothesis using the data observed in a given experiment. The test examines two opposite assumptions about a population:

The Null Hypothesis is the statement being tested. Typically, the null hypothesis is a declaration of “no effect” or “no difference.”
The Alternative Hypothesis is the statement that we want to conclude based on evidence provided by the sample data.

Based on the sample data, the test determines whether we should reject the null hypothesis. Some examples of questions that Hypothesis Tests can answer with the Hypothesis Test are:

Does the average height of university women differ from 1.55m?
Is the standard deviation of this height equal to or less than 12cm?”
Do male students differ in height on average?
Do male undergraduate students have a significantly higher proportion than the proportion of female students?

Some methods used to perform the Hypothesis Test are the Fisher’s method, Pearson’s method, and Bayes method. It is possible to infer a population’s quantity of interest from an observed sample of a scientific experiment through probability theory.

A Hypothesis Test specifies whether to accept or reject a claim about population evidence provided by a sample of data.

Concluding

In this article, we reviewed some of the main concepts of probability and statistics. We talk about sample space and event, expected value, mean, median, and standard deviation. We approached a little sampling and correlation coefficient.

And to close with a gold key, we saw the probability and hypothesis test distributions. It was a brief review of some concepts of probability and statistics and aimed to help us remember some crucial concepts during the data science journey.

And there we have it. I hope you have found this useful. Thank you for reading. ??

要查看或添加评论，请登录

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

2024年11月20日

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

Overview In this project, we’ll explore techniques for exploratory data analysis and dive into the interpretation of…

2 条评论
SQL: Mastering Data Engineering Essentials

2024年9月19日

SQL: Mastering Data Engineering Essentials

Here’s an interesting fact: do you know when the SQL language was created? When it first appeared? I do! It was in…
The Power of Hypothesis Testing

2024年8月3日

The Power of Hypothesis Testing

Hypothesis testing is a fundamental tool in inferential statistics and data science, allowing us to evaluate claims…
Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

2024年8月2日

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

I’m going to bring you now probably the topic that generates the most doubts among those who are just starting their…
Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

2024年7月4日

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

1. Introduction Data preprocessing is a critical step in any data analysis or machine learning project.
Data Splitting in Machine Learning: Techniques and?Pitfalls

2024年7月1日

Data Splitting in Machine Learning: Techniques and?Pitfalls

Machine learning is all the rage these days, but are you really grasping the fundamentals? If you’re diving into this…
Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

2024年6月28日

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

We have completed the first part of our project, which was building the Machine Learning model. Now, let’s move on to…
8 Steps to Building a Machine Learning Model for Classification

2024年6月26日

8 Steps to Building a Machine Learning Model for Classification

Explore the process of creating, training, and deploying a machine learning model to predict product types based on…

1 条评论
9-Step Guide to Building Machine Learning Models

2024年6月24日

9-Step Guide to Building Machine Learning Models

In this article, I will walk you through the process of building machine learning models. I will first describe the…
Data Engineering: Principles of ETL vs. ELT

2024年6月21日

Data Engineering: Principles of ETL vs. ELT

Introduction There is a long journey within data engineering, especially in the ETL process. ETL is an acronym that…

See all articles

Basic Concepts of Probability

Roll a 6-sided Dice

Simulate a dice in R

Expected Value, Average, Variance, and Standard Deviation

Roll the dice N Times

Dispersion Measures

Variance

Standard Deviation

Probability distributions

Population

Sample

Census

Parameters

Data Classification

Continuous Data

Discrete Data

Quantitative Data

Qualitative Data

Probability

Sample Space — S {heads, tails}

Event

Right Event

Impossible Event

Mutually Exclusive Events

Probability — Quantification of Uncertainty

Sampling

Representativeness

Element Extraction

Sample Composition

Pearson Correlation Coefficient

Probability Distribution

Cumulative Probability Distribution

Normal Distribution

Central Limit Theorem

Normal Distribution Properties:

Estimators and Estimates — Statistical Inference

Hypothesis Tests — Significance Test

Concluding

Leonardo A.的更多文章

Techniques for Exploratory Data Analysis and Interpretation of Statistical Graphs

SQL: Mastering Data Engineering Essentials

The Power of Hypothesis Testing

Normalization and Standardization in Data?Science: When to apply one, when to apply the?other?

Mastering Data Preprocessing in Python Pandas: 23+ Clear Examples

Data Splitting in Machine Learning: Techniques and?Pitfalls

Building and Deploying a Machine Learning Model with Flask (Model & Deploy Guide)

8 Steps to Building a Machine Learning Model for Classification

9-Step Guide to Building Machine Learning Models

Data Engineering: Principles of ETL vs. ELT

社区洞察

其他会员也浏览了

Probability and Likelihood in R

Types of Discrete Probability Distributions: Functions, Conditions, and Examples

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 14 ]

How to Learn Intermediate Statistics for Data Science As A Self Starter[ Day - 17 ]

Descriptive Statistics in R

Understanding Bernoulli vs. Poisson Distributions: Key Differences and Use Cases

Coefficient of Determination-R Squared-Day3

4 Assumptions Of Multiple Regression That We Should Always Test

Measurement vs Statistics