BASICS OF PROBABILITY AND STATISTICS :
Priyanka Sethi
Microsoft Certified Azure Data Engineer Associate(DP-203) |Big data Engineer @IBM|SQL| Azure| PySpark |Python| ADB|ADF |CSE| Machine Learning
ANALYSING CATAGORICAL DATA:
Categorical data analysis is the analysis of data where the response variable has been grouped into a set of mutually exclusive ordered (such as age group) or unordered (such as eye color) categories.
TYPES OF VARIABLE(INDIVIDUALS,VARIABLES,QUANTITATIVE AND CATAGORICAL DATA):
Individuals are the objects described by a set of data.
? Population is all individuals of interest. Inferential Statistics: Assume, or infer, something about the population based on data.
? Sample is a subset of the population. Descriptive Statistics: Describing your sample when interpreting your data. Variables are characteristics of individuals. When your data is provided as a list of information, you can identify the individuals of the story problem situation by organizing the data in a spreadsheet format
In a spreadsheet, consider the entries of the variable to determine which type the variable is. 1. Quantitative: Data is described using numbers such that the values of the numbers are used in calculations. For example, calculating the average test score. They can also have units of measure attached to them, such as miles/hour, gallons, inches, seconds, degrees, etc. a. GRAPHS for organizing include Dot plot, Histogram, Stem plot, Boxplot, etc. 2. Categorical: Data described usually with words but can be numbers if the values are not taken into consideration, such as with SS#, Telephone Number, Zip Code, ISBN#, Driver’s License, etc. a. GRAPHS for organizing include Bar Graph, Pie Chart, etc.
Pictographs
A pictograph is the representation of data using images. Pictographs represent the frequency of data while using symbols or images that are relevant to the data. This is one of the simplest ways to represent statistical data. And reading a pictograph is made extremely easy as well.
How to make a Pictograph
Let us take an example. We must represent how many TV sets have been sold in the last few years via a pictograph. So we get started
- Collect Data: First step is obviously collecting the data of the category you want to represent. Collect your data by appropriate means.And then make a list or table of the data. And one time.finally review the data.
- Pick your symbol: Pick a symbol or picture that accurately represents your data. If you are drawing a pictograph to represent TV sets sold then a symbol of a basketball would be highly confusing! So pick your symbol carefully.
- Assign a Key: Sometimes the frequency of the data is too high. Then one symbol cannot represent one frequency. You must set a numerical value that one symbol will represent. This numerical value must be written along with the pictograph. Example one symbol of a TV represents 500 TV sets. This is the key of the pictograph.
- Draw the pictograph: Final step is drawing your pictograph. Draw the two columns that represent the category and the data. Then draw the actual symbols that represent the frequencies. Remember that the symbols can be drawn as fraction as well if the frequency is not a whole number.
- Review your Data: And finally, review your pictograph and make sure it correctly represents the information that you wanted to relay. Don’t forget to check the labelling of your graph.
Bar Graph :In Math, we deal with a lot of numbers. For example, equation,integers,fraction and so on. But, looking at numbers all the time can get very confusing and tiresome. For this reason, sometimes we take the help of bar graphs, tables, charts etc. to make sense of all these numbers. To begin with, let us learn about the bar graph.
How to Make and Use a Bar Graph
- Collect data: The first step in drawing any graph is to collect data. Since bar graphs is a comparative study, make sure you collect data for all the categories.
- Draw the axis: In any graph, there are two axes. Draw the x-axis and the y-axis.
- Label the axis: First, label the x-axis. On the x-axis, we represent the categories. For example, we label the names of chocolates i.e. A, B, C on the x-axis. Next, label the y-axis. To do this, take the highest frequency and plot the points on the y-axis accordingly.
- Draw the Columns: Finally, we draw the bars. In general, the bars are not connected or continuous. Now, extend the bars from the base value to their corresponding frequency. And if the value falls between the plotted frequencies, take an approximate point between the two.
- Interpret the Data: Once the bar graph is complete, we can interpret it. We can find out the most and least preferred choices. We can also identify the outliers.
Pie Charts :A pie chart or a pie graph is a circular representation of data. A pie chart not only represents frequency but also numerical proportion. Each section of a pie chart is the proportionate quantity of the whole data. And the total value of a pie chart is always 100 (just as a percentage)
How to draw a Pie Chart
Let us now take a step-by-step look at how to represent data via pie charts
- Data: Gather your values and arrange them in a descending order.
- Find denominator: Add all the values to arrive at a whole value or numbers.This will be your denominator
- Percentage: Now we calculate the percentage of each value in relation to the total values. We simply divide each value with the denominator we got from step 2. Also, it will be easier to leave the value in decimal form
- Calculate the angles: So we multiply each percentage (in decimal form) by 360 (degrees in a circle)
- Circle: And now we draw a circle to begin drawing a pie chart
- Draw individual sections: Now with the help of a protractor we draw each section of the pie chart. We use the angles we obtained from step 4
- colour: Each section must be a different colour so it can be easily identified.
- Review: And the final step is to review the information
Displaying and comparing quantitative data:
FREQUENCY TABLE AND DOT PLOTS:
The frequency of a particular score is the number of times the score occurs in the data. A frequency table is a tabular representation of the data in with their corresponding frequencies. Frequency tables can be used for numerical and categorical data
Shape of Data :When data is represented as a histogram we say that the distribution is: ? Symmetric if there is a single peak and the data trail off on either side in roughly the same manner. ? Negatively skewed if the data peak to the right and trail off to the left (negative direction) ? Positively skewed if the data peak to the left and trail off to the right (positive direction) If a histogram has 2 peaks the data is called bimodal. This is often the case if the data has come from two different populations. For example the heights of a group of Junior School students and a group of Senior School students. An outlier is a value that stands out from the main body of the data. A data set may have more than one outlier. The range of a set of data is the highest value – lowest value.
Using a log scale to display data Some data will have a huge range which can make displaying this data quite difficult. For example: population of countries ranging from thousands through to billions. Consider the populations 17 000, 49 000, 210 000, 1 200 000, 13 000 000 1 000 000 000 Plotting the histogram of populations in thousands results
Dot Plots :
A dot chart or dot plot is a statistical chart consisting of data points plotted on a fairly simple scale, typically using filled in circles. There are two common, yet very different, versions of the dot chart.
Histograms: A histogram is similar to a bar chart, but is only for numerical data and can also be used for class intervals. It has no gaps between the columns. For grouped data the labels appear under the edge of each column
Stem and leaf plots
A stem and leaf plot displays numerical data by splitting each data point into a "leaf" (usually the last digit) and a "stem" (the leading digit or digits).
For example, the buyer for a chain of department stores counted the number of pairs of boots at each of the stores and made a stem and leaf plot for the data.
Mean, Median and Mode are average values or central tendency of a numerical data set.
Mean
The first measure we will study is the mean also known as average. Mean can be calculated by adding all data points and dividing by the number of data points.
Median
Median is the middle value of a sorted data set; found by ordering all data points and picking out the one in the middle (or if there are two middle numbers, taking the mean of those two numbers).
HOW TO CALCULATE MEDIAN:
- Arrange your numbers in numerical order.
- Count how many numbers you have.
- If you have an odd number, divide by 2 and round up to get the position of the median number.
- If you have an even number, divide by 2. Go to the number in that position and average it with the number in the next higher position to get the median.
Mode
The most frequent number—that is, the number that occurs the highest number of times. Example: The mode of {4 , 2, 4, 3, 2, 2} is 2 because it occurs three times, which is more than any other number.
Cumulative Frequency
Cumulative frequency is the running total of the frequencies. On a graph, it can be represented by a cumulative frequency polygon, where straight lines join up the points, or a cumulative frequency curve.
What are z-scores?
A z-score measures exactly how many standard deviations above or below the mean a data point is.
Here's the formula for calculating a z-score:
z=data point?mean/standard deviation
Here are some important facts about z-scores:
- A positive z-score says the data point is above average.
- A negative z-score says the data point is below average.
- A z-score close to 0 says the data point is close to average.
- A data point can be considered unusual if its z-score is above 3 or below -3
TRANSFORMING DATA PROBLEM:
It is very common to take data and apply the same transformation to every data point in the set. For example, we may take a set of temperatures taken in degrees Fahrenheit and convert them all to degrees Celsius. How would this conversion impact the measures of center of spread in the data set? like ADDING AND MULTIPLYING CONSTANT.
What is a normal distribution?
Early statisticians noticed the same shape coming up over and over again in different distributions—so they named it the normal distribution.
Normal distributions have the following features:
- symmetric bell shape
- mean and median are equal; both located at the center of the distribution
- ≈95% approximately equals, 95, percent of the data falls within 2 standard deviations of the mean
- ≈99.7% approximately equals, 99, point, 7, percent of the data falls within 3 standard deviations of the mean
What is a scatterplot?
A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the dataset gets plotted as a point whose (x, y) (left parenthesis, x, comma, y, right parenthesis
coordinates relates to its values for the two variables. When we look at scatterplot, we should be able to describe the association we see between the variables.
A quick description of the association in a scatterplot should always include a description of the form, direction, and strength of the association, along with the presence of any outliers.
Form: Is the association linear or nonlinear?
Direction: Is the association positive or negative?
Strength: Does the association appear to be strong, moderately strong, or weak?
Outliers: Do there appear to be any data points that are unusually far away from the general pattern?
It's also important to include the context of the two variables in the description of these features.
What is correlation?
We often see patterns or relationships in scatterplots. When the y variable tends to increase as the x variable increases, we say there is a positive correlation between the variables. When the y variable tends to decrease as the x variable increases, we say there is a negative correlation between the variables.
We can simulate events involving randomness like picking names out of a hat using tables of random digits. Tables of random digits can be used to simulate a lot of different real-world situations.
Things to know about random digit tables:
- Each digit is equally likely to be any of the 10 digits 0 through 9
- The digits are independent of each other. Knowing about one part of the table doesn't give away information about another part.
- The digits are put in groups of 5 just to make them easier to read. The groups and rows have no special meaning. They are just a long list of random digits.
Probability means possibility. It is a branch of mathematics that deals with the occurrence of a random event. The value is expressed from zero to one. Probability has been introduced in Math's to predict how likely events are to happen.
- The meaning of probability is basically the extent to which something is likely to happen. This is the basic probability theory, which is also used in the probability distribution , where you will learn the possibility of outcomes for a random experiment. To find the probability of a single event to occur, first, we should know the total number of possible outcomes.
formulae: Probability of event to happen P(E) = Number of favorable outcomes/Total Number of outcomes
- The probability of an event can only be between 0 and 1 and can also be written as a percentage.
- The probability of event A is often written as P(A)
- If P(A) > P(B) then event A has a higher chance of occurring than event B
if P(A)=P(B) A and B are equally likely to occur.
Combinations and Permutations
Another type of counting question is when you have a given number of objects, you want to choose some (or all) of them, and you want to know how many ways there are to do this. For example, a teacher with a class of 30 students wants 5 of them to do a presentation, and she wants to know how many ways this could happen. These types of questions have to do with combinations and permutations. The difference between combinations and permutations is whether or not the order you are choosing the objects matters.
- A teacher choosing a group to make a presentation is a combination problem, because order does not matter.
- A teacher choosing 1st-, 2nd-, and 3rd-place winners in a science fair is a permutation problem, because the order does matter. (1st place and 2nd place are different outcomes.)
RANDOM VARIABLES AND PROBABILITY DISTRIBUTION:
A random variable is a numerical description of the outcome of a statistical experiment. A random variable that may assume only a finite number or an infinite sequence of values is said to be discrete; one that may assume any value in some interval on the real number line is said to be continuous.
The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable. For a discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by f(x). This function provides the probability for each value of the random variable. In the development of the probability function for a discrete random variable, two conditions must be satisfied: (1) f(x) must be nonnegative for each value of the random variable, and (2) the sum of the probabilities for each value of the random variable must equal one.
A Confidence Interval is a range of values we are fairly sure our true value lies in. confidence interval (CI) is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range. a valid confidence interval has a probability of containing the true underlying parameter. The level of confidence can be chosen by the investigator. In general terms, a confidence interval for an unknown parameter is based on sampling the distribution of a corresponding estimator.
We use p-values to make conclusions in significance testing. More specifically, we compare the p-value to a significance level \alpha α to make conclusions about our hypotheses.
If the p-value is lower than the significance level we chose, then we reject the null hypothesis H0 , start subscript, 0, end subscript in favor of the alternative hypothesis Ha ,h start subscript, start text, a, end text, end subscript
. If the p-value is greater than or equal to the significance level, then we fail to reject the null hypothesis H0 H, start subscript, 0, end subscript but this doesn't mean we accept H_0H, start subscript, 0, end subscript
p-value<α ?reject H0?accept Ha
p-value≥α?fail to reject H0
Review: Error probabilities and alpha
A Type I error is when we reject a true null hypothesis. Lower values of \alpha make it harder to reject the null hypothesis, so choosing lower values for alpha can reduce the probability of a Type I error. The consequence here is that if the null hypothesis is false, it may be more difficult to reject using a low value for alpha. So using lower values of alpha can increase the probability of a Type II error. A Type II error is when we fail to reject a false null hypothesis. Higher values of alpha make it easier to reject the null hypothesis, so choosing higher values for alpha can reduce the probability of a Type II error. The consequence here is that if the null hypothesis is true, increasing alpha makes it more likely that we commit a Type I error (rejecting a true null hypothesis).
Inferences about the Differences of Two Populations
Up to this point, we have discussed inferences regarding a single population parameter (e.g., μ, p, σ2). We have used sample data to construct confidence intervals to estimate the population mean or proportion and to test hypotheses about the population mean and proportion. In both of these chapters, all the examples involved the use of one sample to form an inference about one population. Frequently, we need to compare two sets of data, and make inferences about two populations. This chapter deals with inferences about two means, proportions, or variances. For example:
- You are studying turkey habitat and want to see if the mean number of brood hens is different in New York compared to Pennsylvania.
- You want to determine if the treatment used in Skaneateles Lake has reduced the number of milfoil plants over the last three years.
- Is the proportion of people who support alternative energy in California greater compared to New York?
- Is the variability in application different between two mist blowers?
These questions can be answered by comparing the differences of:
- Mean number of hens in NY to the mean number of hens in PA.
- Number of plants in 2007 to the number of plants in 2010.
- Proportion of people in CA to the proportion of people in NY.
- Variances between the mist blowers.
this has also 5 sections:
1.Inferences about Two Means with Independent Samples (Assuming Unequal Variances)
2.Pooled Two-sampled t-test (Assuming Equal Variances)
3.Inferences about Two Means with Dependent Samples—Matched Pairs
4.Inferences about Two Population Proportions
5.F-Test for Comparing Two Population Variances
What is a Chi Square Test?
There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:
- A chi-square goodness of fit test determines if a sample data matches a population.
- A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variable differ from each another.
- A very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.
- A very large chi square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.
A chi-square statistic is one way to show a relationship between two catagorical variable. In statistics, there are two types of variables: numerical and non numerical variables. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there were no relationship at all in the population.
INFERENCE FOR REGRESSION
We usually rely on statistical software to identify point estimates and standard errors for parameters of a regression line. After verifying conditions hold for fitting a line, we can use the methods learned earlier for the t-distribution to create confidence intervals for regression parameters or to evaluate hypothesis tests.
LINEAR VS NON LINEAR REGRESSION:
Many people think that the difference between linear and nonlinear regression is that linear regression involves lines and nonlinear regression involves curves. This is partly true, and if you want a loose definition for the difference, you can probably stop right there. However, linear equations can sometimes produce curves.In order to understand why, you need to take a look at the linear regression equation form.
Linear regression can, surprisingly, produce curves.
Nonlinear regression uses nonlinear regression equations, which take the form:
Y = f(X,β) + ε
X = a vector of p predictors,
- β = a vector of k parameters,
- f(-) = a known regression function,
- ε = an ERROR term.
Analysis of Variance (ANOVA) is a statistical test used to determine if more than two population means are equal.
The test uses the F-distribution (probability distribution) function and information about the variances of each population (within) and grouping of populations (between) to help decide if variability between and within each populations are significantly different.
.1. Know the purpose of the analysis of variance test.
The analysis of variance (ANOVA) test statistics is used to test if more than 2 population means are equal.
2. Know the difference between the within-sample estimate of the variance and the between-sample estimate of the variance and how to calculate them.
3. Know the properties of an F-distribution:
There is an infinite number of F-Distribution based on the combination of alpha significance level,
the degree of freedom (df1) of the within-sample variance and the degree of freedom (df1) of the between-sample variance.
4. Know how sum of squares relate to Analysis of Variance.
Total Variation = Explained Variation + Unexplained Variation.
sum of Squares
The sum of squares for the between variation is either given by the symbol SSB (sum of squares between)
or SSTR (sum of squares for treatments) and is the explained variation.
To calculate SSB or SSTR, we sum the squared deviations of the sample treatment means from the grand mean
and multiply by the number of observations for each sample.
The sum of squares for the within sample variation is either given by the symbol SSW (sum of square within)
or SSE (sum of square for error).
To calculate the SSW we first obtained the sum of squares for each sample and then sum them.
The Total Sum of Squares, SSTO = SSB + SSW
5. Know how to construct an ANOVA Table.
The various statistics computed from the analysis of variance above can be summarized in an ANOVA Table
6. Know how to interpret the data in the ANOVA table against the null hypothesis.
Acceptance Criteria for the Null Hypothesis:
If the F-statistics computed in the ANOVA table is less than the F-table statistics or
the P-value if greater than the alpha level of significance, then there is not reason to reject the null hypothesis
that all the means are the same
7. Know the procedure for testing the null hypothesis that the mean for more than two populations are equal.
Step 1 - Formulate Hypotheses
Step 2. Select the F-Statistics Test for equality of more than two means
Step 3. Obtain or decide on a significance level for alpha, say
Step 4. Compute the test statistics from the ANOVA TABLE
Step 5. Identify the critical Region: The region of rejection of H0 is obtained from the f-table with alpha and degrees of freedom (k-1, n-k).
Step 6. Make a decision:
That is accept H0 if: F-Statistics < F-table or P-value > alpha.