A Beginner's Blueprint to Understanding Data
Basic Definitions:
Population: has all the members of a group which we want to draw a conclusion from, numeric measure: Parameter, technique to collect data: census
Samples: portion of the population selected for analysis, numerical measure: statistics, technique to collect data: sample
Statistics: collection, processing, presentation, analysis, and interpretation of data, giving the ability to use the?appropriate tool correctly
Descriptive Statistics: summary of sample data
Inferential Statistics: conclusion is made about the population based on the sample
Variable: any characteristic, number, or quality that can be measured or counted
Categorical: describes groups of categories
Numerical: describes quantities
Measurement Scale: used to measure data, depending on data types:
For categorical:
For numeric:
Sources of data:
External/Secondary Source: prepared by others, time and cost-efficient, might not contain exact information
Own/Primary Source: collected oneself, time and cost issues
Techniques of data collection:
Census: investigate the whole population, time-consuming and expensive
Sample: investigate the subset of the population and conclude the whole population
Sampling methods: depends upon whether they take into account probability or not
Non-probability samples: items included are chosen without considering their probability of occurrence,?might be inaccurate and prone to biases, used when the representative sample is not essential?
Probability samples: Samples are chosen based on known probabilities, a more reliable, generalized?representation of the population, preferred when making inferences about a larger population
Survey Errors:
Data Visualization
Single Categorical data:
Single Numerical data:
Width ~= range/number of desired class
Usually, 5 to 15 groups are there
Class boundaries must be mutually exclusive (intervals should not overlap)
Class must be collectively exhaustive (all observations should fall into one and only one?
interval)
Round up the interval width to get desirable endpoints
plot within each class, (relative frequency=frequency in each class/total frequency), used to?
compare samples with different sample sizes
Graphing frequency distribution:
Two categorical variables:
common values or percentages have row total and column total?
Two numerical data:
Numerical descriptive measures:
Measure of central tendency:
Population mean (μ)
Sample mean (x)
First quartile position (Q1) = (n+1)/4, Q1 is 25th percentile i.e. 25% of data is below
this point
Second quartile position (Q2) = (n+1)/2 aka median, Q2 if 50th percentile, i.e. 50% data?
the site below this value
Third quartile position (Q3) = 3(n+1)/4, Q3 is 75th percentile, i.e. 75% of data sits?
below this value
If Q is discrete, it's fine, if Q is sth.5, take the midpoint or roundoff to the nearest value
Measure of Variation or Dispersion:?
extreme values
Sample mean = sum of the square of the distance of every data point and the mean /?
(n-1)
Standard deviation: square root of sample variance, affected by extreme outliers, is an absolute?
measure
Coefficient of variation: a relative measure of variance, CV= (SD/mean)*100%
Z scores: how far a value sits from the mean, measured in standard deviations, Z=(X-mean)/SD
Measure of shape:?
Skewness: measures the degree of asymmetry in a distribution
values, mean < median
values, mean > median?
Kurtosis: measured degree of Preakness or flatness of a distribution
How do we identify the shape of a distribution??
Where does the data fit within the distribution relative to mean and SD??
> μ ± 1σ contains about 68.26% of the values of the population.
> μ ± 2σ contains about 95.44% of the values of the population.
> μ ± 3σ contains about 99.73% of the values of the population.
Z score: measures how many SD the data is from the mean
Detecting outliers:
below (Lower fence =Q1-1.5xIQR).
Numerical descriptive measures from frequency distribution:?
Descriptive measures for the relationship between two variables:
The strength of the relationship: how strongly one influences other
The direction of the relationship: is the influence positive or negative
Probability and discrete probability distribution:
Probability: possibility an event will occur, ranging between 0 and 1, 3 approaches of assigning probability to an event:
A priori classical probability: based on prior knowledge
Empirical classical probability: based on observed data or study
Subjective probability: individual judgment or opinion, less precise than the above two
Events: a particular event from a random experiment, e.g. getting a 3 by rolling a dice
Simple event (A): outcome with a single characteristic A, e.g. Rolling a 6 on a single die.
Complement of an event (A’): all outcomes that do not include A, e.g. If event A is "rolling a 6 on a die,"?then A' is "rolling any number other than 6."
Joint event (AnB): two or more characteristics simultaneously (AnB), e.g. Rolling a 6 and flipping a coin that lands on heads
Mutually exclusive events: events cannot occur together, e.g. getting both head and tail in a single coin?toss, P(A or B) = P(A) + P(B) = 1
Collectively exhaustive events: events cover or exhaust all possible outcomes or at least one of them must?occur,e.g. When rolling a die, the events of rolling a 1, 2, 3, 4, 5, or 6 are collectively exhaustive, P(A) + P(B) + P(C) + ... = 1
Sample space: a collection of all possible events, e.g. 6 faces of dice
Visualizing events:
Marginal, Joint, and Conditional probability:
Marginal probability p(A): the probability of an event occurring, unconditional, e.g. probability of?
drawing red card p(red)=0.5
p(A) = p(A|B1)*p(B1)+p(A|B2)*p(B2)+.....................+p(A|Bn)*p(Bn)
Joint probability p(AnB): Probability of events A and B occurring together, e.g. the drawn card is both?
red and is a four?
P(A n B) = P(A|B) *P(B) (if statically dependent)?
Or
P(A n B) = P(A) *P(B) (if statically independent)
Conditional probability p(A|B): Probability of event A occurring given that B has occurred, e.g. given that you drew a red card, what is the probability that it is four?
p(A|B)=p(A n B)/p(B) (probability of A given B has occurred)
Probability rules:
General addition rule: p(A or B)=p(A)+p(B)-p(A and B) (either A or B occurs)
If events are mutually exclusive, p(A or B)=p(A)+p(B)
Conditional probability: p(A|B)=p(A and B)/p(B) (probability of A given B has occurred)
Decision tree technique: the end of branches = marginal probability, sub-branches = joint probability
Statical independence: two events are statically independent if:
p(A|B) = p(A) or p(B|A)= p(B)
I.e. probability of one event is not affected by the other
E.g. result of first coin toss doesn’t affect the result of second coin toss
Multiplication rule:
P(A n B) = P(A|B) *P(B) (if statically dependent)?
Or
P(A n B) = P(A) *P(B) (if statically independent)
Baye’s theorem:
An extension of conditional probability
A technique used to revise previously calculated probability if new information is added
p(Bi|A) = (p(A|Bi)*p(Bi))/p(A)
Used when given p(A|B) and we need to calculate p(B|A)
Counting rules:
Rule 1:
If K is a mutually exclusive and collectively exhaustive event and can occur n times, K^n
E.g. tossing a coin 5 times, the number of possible outcomes is 2^5=32
Rule 2:
If K1 events on 1st trial, K2 events on second trial,...Kn events on nth trial, possible outcome is?
K1*K2*.....Kn
E.g. license number has 3 letters followed by 3 digits, possible combination is:?
26^3*10^3=117576000
Rule 3:
Number of ways a set of n items can be arranged in order
N! = n*(n-1)....(1), n! is the n factorial
E.g. set of 3 textbooks to be placed on shelf, the book can be arranged in 3!=3*2*1=6 ways
Rule 4: (permutation)
In how many ways a sub-group of the entire group can be arranged in order?
If x items need to be arranged IN AN ORDER? and selected from n items, the order is:
nPx = n!/(n-x)!
E.g. from 6 textbooks, only 4 textbooks can fit on a shelf, the book can be arranged on a shelf in
6P4=6!/(6-4)!=360 ways?
Rule 5: (combination)
In how many ways a subgroup can be arranged but NOT INTERESTED IN ORDER from a group
X items selected from n items irrespective of order can be arranged in:
nCx=n!/(x!*(n-x)!)
E.g. choose 4 textbooks from 6 textbooks to place on a shelf
6C4=6!/(4!*(6-4!))=15 ways
Probability distributions:
An equation that associates a particular probability of occurrence with each outcome in the sample space
Probability distribution of discrete random variables:
Is a mutually exclusive list of all possible numerical outcomes of a random variable with?
a probability of occurrence associated with each outcome
Expected value or mean = E(X)=μ=sum(Xi*p(Xi)) where i = 1 to N
E.g. Binomial distribution, Poisson distribution
Binomial distribution:
Scenarios with outcome of SUCCESS or failure, REPEATED over a series
E.g. manufacturing plant labels items as defective or acceptable
4 properties:
Q. a customer has 35% probability of making a purchase, 10 customer enter?shop, what is the probability of 3 customers making a purchase??
p=0.35, n=10, p(3)=???
p(x=3)=(10!/(3!(10-3)!))*0.35^3*(1-0.35)^(10-3)=0.2522
mean=np
Variance = np(1-p)
SD=sqrt(variance), n is sample size, p is probability of success
Poisson distribution:
Number of time an incidence occurs in an interval of time or space
E.g. number of enquiries in an hour
Scenarios:
Probability of event occurring in any interval is same for all intervals of?same size
Number of occurrence of event in one interval is independent to others
As the interval becomes smaller, two or more occurrence of even in the interval approached zero
Formula:
Mean = λ
Variance = λ
Standard deviation = sqrt(λ)
Continuous probability distribution:
E.g. normal distribution, uniform distribution, exponential distribution
Since the random variable takes a continuous value, trying to find the probability of a?the particular point is almost equal to 0, so we find the area of curve under the curve to determine if the random value is within the range or not
Normal distribution:
Scenarios:
Most of the values are clustered around center value
E.g. measuring weight of packages in manufacturing belt, investigating?length of call in call center, etc
Most of the values are fairly constant with only little outliers
Histogram: symmetrical, bell shaped
Properties:
Mean, median and mode are equal
Center location is mean = μ
Spread is standard deviation = σ
X has infinite theoretical range of -infinity to +infinity
领英推荐
Standardized the normal distribution:
Mean = o, SD = 1, values above mean has +Z and below have -Z
To standardize X into Z, Z=(X-μ)/σ
Applying formula doesn’t change distribution but just the scale
Finding normal probabilities:
Probability is measured by area under the curve
Total area under the curve has P=1
Empirical rule:?
μ ± 1σ contains about 68.26% of observations.
μ ± 2σ contains about 95.44% of observations.
μ ± 3σ contains about 99.73% of observations
The z table doesn’t pinpoint the probability but gives probability less?than specified Z point
E.g. find probability of p(X<8.6) with mean 8 and SD 5
Z=(8.6-8)/5 = 0.12
From z table, P(Z<0.12) = 0.5478
How to know if data is normally distributed?
Compare with characteristics of normal distribution
Symmetrical so mean and median are equal (do 2/3 of the?values lie between the mean ±1 standard deviation?)
Bell shaped so empirical rule applies
Interquartile range approximates to 4/3 SD
Range is infinite (range is approx 6*SD)
Construct a normality plot
Box and whisker plot, histogram, polygon
Quantile-quantile plot (refer slides), line drawn is straight
Uniform distribution/rectangular distribution:
Has equal probabilities for all outcomes
Density function, f(X)=1/(a-b), a=min value of X, b=maximum value of X
Mean = (a+b)/2
Sd=sqrt((b-a)^2/12)
p(X<=c) = (c-a)/(b-a)
To find probability, calculate the area of the rectangle?
E.g. p(0.1<X<0.3)=base*height= 0.2*1=0.2
Exponential distribution:
Continuous right skewed distribution ranging from 1 to +ve infinity
Mean > median?
Scenarios: used to model time between random, independent events
E.g: time between arrival of customers in a restaurant
λ=expected number of events per interval
mean=Sd=1/λ
Q. customer arrives at the rate of 15 per hour, find probability that the arrival?time is less than 3 minutes?
Lambda = 15 per hour = 0.25 per minute
p(X<3)= 1-e^(-0.25*3) = 0.5276
Sampling distribution:
Distribution of results if we actually selected all possible samples
Helps to know how good fit is the sample to describe the populations
Central limit theorem:
as the sample size (i.e. the number of values in each sample) gets large enough, (generally n ≥ 30),?the sampling distribution of the mean is approximately normally distributed, regardless of the?shape of the distribution of the individual values in the population.
Z formula for applying sampling distribution:
Z=(Xbar-meu)/(sigma/sqrt(n))
Sampling distribution of the mean
Sampling distribution of the proportion (for categorical variables)
Pie is the proportion of item in the population, n is sample size
If n*pie >= 5 and n(1-n)>=5, we can apply normal distribution in the proportion
Inferential techniques:
A measure to draw conclusion about a population based on sample
Parametric technique: assumes the data follows a specific probability distribution
E.g. confidence interval, hypothesis testing
Non-parametric: doesn’t assume any distribution, suitable is sample is small and no normal distribution if?followed?
Confidence Interval:
Range of values around a point estimate
Based on observation from one sample but takes into consideration variation from sample to?
sample
Gives information about the closeness to unknown population parameters
Never 100% confident
Used when we have no idea about value of population parameter being investigated
Confidence interval = Point estimate +- critical value * SD
Level of confidence:
Most common are: 90%, 95%, 99% or respective alpha = 0.1, 0.05, 0.01
Also, (1-alpha)
Alpha = significance level, probability of making type 1 error
There is a tradeoff between confidence and precision (wider interval has more confidence?
but the value becomes less precise and less useful)
E.g. i am 99% confident that my salary is between 50k to 100k
I am 95% confident that my salary is between 50k to 60k
Confidence interval estimate of mean (sigma known)
Use normal distribution
Sigma is population Sd
Z is the critical value for alpha/2
Confidence interval estimate of mean (sigma unknown)
Use student-t distribution
Use sample SD, S
Degree of freedom=n-1
T table gives upper tail area of alpha
Confidence interval estimate of proportion:
Z is the critical value where n-1 is DoF and alpha/2 in each tail
Determining sample size for mean:
Appropriate sample size is required to balance the confidence and precision
Determining sample size for proportion:
Always ROUND UP!
When determining the sample size for a population proportion for a given level of?
confidence and sampling error, the closer to 0.50 that population proportion is estimated?
to be, the sample size required: is larger
Ethical issue for confidence interval:
A confidence interval estimate (reflecting sampling error) should always be?
included when reporting a point estimate.
The level of confidence should always be reported.
The sample size should be disclosed.
An interpretation of the confidence interval estimate should also be provided.
Hypothesis testing
Used when we have some idea of population parameter being evaluated
If we have: prior knowledge, prior experience, a standard, a claim
Checks if the sample statistics is consistent with assumed population parameter or not
Identify hypothesis:
Null hypothesis (Ho): has =. >=, <= sign, supports status quo
Alternative hypothesis (H1): challenges status quo, has >, < , ≠
Is a claim, a researcher is trying to prove
Non-rejection region: between mean and critical values on either sides, cannot reject null hypothesis
Rejection region: remaining region, reject null hypothesis
Errors in decision making:
Type 1 error: rejecting Ho when it is actually correct
Also called level of significance (alpha)
Type 2 error: not rejecting Ho when it is actually false
We do not control this error, difference between hypothesized and real?
value, represented by beta
Large beta = small error
Hence, we set up our hypothesis to minimize the type of error accordingly, i.e. type 1 error is more serious so we need to control it, can be controlled by increasing sample size
Conservative approach: accept Ho if only there are sufficient proofs to validate it
The p-value approach:
If p >=alpha, do not reject Ho
If p<alpha, reject Ho?
Ethical considerations for hypothesis test:
Sample data must be selected randomly to reduce selection bias and coverage?error
Humans should be informed before being surveyed
Choose level of significance and type of test before data collection
Do not cleanse data to hide observations that do not support stated hypothesis
Distinguish between statistical and practical significance
Simple linear regression:
Regression analysis is used to:
Predict value of dependent variable (Y) w.r.t. At least one dependent variable?
(X)
Explains impact of change in independent variable on the dependent variable
Simple linear regression, Yi=Bo+B1Xi+Ei
B1= mean amount of Y change for one-unit X change
Bo = mean value of Y when X=0
Ei= random error that explains difference between predicted and observed value
Interpolation, Extrapolation: validation occurs only within a range of values
Measure of variation to find how well model fits in our data:
SST, sum of squares: variation Y around their mean Y
SSR, regression sum of squares: variation attributed? due to?relationship between X and Y
SSE, error sum of squares, variation attributed to factors other?than relationship between X and Y
SST= SSR + SSE
Least square method minimizes SSE.
Coefficient of determination, R-square
R-squared = SSR/SST, value ranges between 0 and 1
Standard error of estimate
Standard deviation of variability of Y around the prediction?line
Sxy = sqrt(SSE/(n-2))
Larger the SE, larger data is deviated from the regression line
Should be judged relative to the size of Y values
Residual analysis
Graphical technique to find if simple linear regression is?better fit for data or not
4 key assumptions called LINE:
Linearity: X and Y have linear relationship
Independence of errors: no errors are related to one another
Normality of error: for any given value of X, error is normally?distributed
Equal variance (homoscedasticity): Probability distribution of?error has constant variance, i.e. variability if Y will be same for low or high value of X
Inferential techniques to test if regression is good fit or not
T test for population slope
F test for significance
Confidence interval estimate for the slope
T test for the correlation coefficient
T test for population slope:
Ho: B1 = 0 (no linear relationship)
H1: B1 ≠ 0 (linear relationship exists)
Degree of freedom n-k-1 = n-2
T = (b1-B1)/Sb1? Sb1 is standard error of slope , b1 is?regression slope coefficient
T > alpha, reject Ho
F test for significance:
F = MSR/MSE where, MSR = SSR/k and MSE = SSR/(n-k-1)
Confidence interval estimate for the slope:
T test for the correlation coefficient
Rho is correlation coeff, r is sample correlation coeff
Multi linear regression:
Adjusted R-square:?
R- square never decreases with increase in X
Hence, adjusted R-square is used
It penalizes the excessive use of unimportant independent variables
Is smaller than R-square
Is useful in comparing among the models
Takes in account: sample size, number of independent variables
F test for overall significance:
Significance of individual variable:
Indications of strong collinearity
incorrect signs on the coefficients
large change in the value of a previous coefficient when a new variable?is added to the model
a previously significant variable becomes non-significant when a new?independent variable is added
the estimate of the standard deviation of the model increases when a?variable is added to the model.
VIF (variance inflation factor)
VIF = 1/(1-R-squared)
VIF = 1, uncorrelated with other X, >10 related highly with other X
Classical multiplicative time series model components:
4 components: trend, seasonal, cyclical, irregular
Trend, T: data taken over long time, persistent increase or decrease
Seasonal: regular short term fluctuations
Cyclical, C: long term wave like, repeats 2-10 years has phases:?
peak>contraction>depression>expansion
Irregular, I: random
multiplicative time-series model for annual data: Y=Ti*Ci*Ii
multiplicative time-series model with seasonal component: Y= Ti*Ci*Ii*Si
Smoothing technique:
Moving average
Exponential smoothing: moving average with weight, weight goes on dressing,?
more weight is given to recent data
Model selection tips:
Linear: if first difference are approximately equal
Quadratic: if second difference are approximately equal
Exponential: if percentage difference are approximately equal
?
?
PhD Student at SimulaMet | Enrichment student at the Alan Turing Institute; ??
4 个月??