A Beginner's Blueprint to Understanding Data

A Beginner's Blueprint to Understanding Data

Basic Definitions:

Population: has all the members of a group which we want to draw a conclusion from, numeric measure: Parameter, technique to collect data: census

Samples: portion of the population selected for analysis, numerical measure: statistics, technique to collect data: sample

Statistics: collection, processing, presentation, analysis, and interpretation of data, giving the ability to use the?appropriate tool correctly

Descriptive Statistics: summary of sample data

Inferential Statistics: conclusion is made about the population based on the sample

Variable: any characteristic, number, or quality that can be measured or counted

Categorical: describes groups of categories

Numerical: describes quantities

  • Discrete: has distinct values
  • Continuous: has infinite possible values

Measurement Scale: used to measure data, depending on data types:

For categorical:

  • Nominal: uses labels or names of identified attributes
  • Ordinal: nominal characteristics + possess a meaningful order or rank

For numeric:

  • Interval: Ordinal properties + have fixed unit of measurement
  • Ratio: Interval properties + ratio of two values is meaningful, has true zero i.e. zero represents no?value exists

Sources of data:

External/Secondary Source: prepared by others, time and cost-efficient, might not contain exact information

Own/Primary Source: collected oneself, time and cost issues

Techniques of data collection:

Census: investigate the whole population, time-consuming and expensive

Sample: investigate the subset of the population and conclude the whole population

Sampling methods: depends upon whether they take into account probability or not

Non-probability samples: items included are chosen without considering their probability of occurrence,?might be inaccurate and prone to biases, used when the representative sample is not essential?

  • Convenience sampling: A sample is selected if it is easy and inexpensive
  • Judgment sampling: The sample is selected based on the opinions of preselected experts
  • Quota sampling: a predetermined number of individuals from each subgroup is selected
  • Purposive sampling: individuals are selected based on their specific characteristics or knowledge
  • Snowball sampling: participants are asked to recommend others who might be suitable for the?study

Probability samples: Samples are chosen based on known probabilities, a more reliable, generalized?representation of the population, preferred when making inferences about a larger population

  • Simple random sampling: every individual has an equal chance of being selected
  • Systematic sampling: a starting point is selected and every nth individual is chosen
  • Stratified sampling: the population is divided into strata (homogenous, specific characteristics)??and a random sample is drawn from each strategy, with sample size proportional to strata size
  • Cluster sampling: the population is divided into clusters (heterogenous, have all characteristics)?and one cluster is selected with all individuals from the cluster, which requires a larger sample size?compared to stratified for the same level of precision

Survey Errors:

  • Coverage error: key members of the population are excluded
  • Nonresponse error: the individual doesn’t respond to the survey
  • Sampling error: random difference between samples
  • Measurement error: respondent error, ambiguous worded question, halo effect,? error in measurement or?recording of data

Data Visualization

Single Categorical data:

  • Summary tables: use numbers and percentages both
  • Bar chart: nominal data, used to compare frequency, percentages, proportions
  • Pie chart: nominal data, shows frequency or percentage, a portion of the total

Single Numerical data:

  • Ordered array: arranges data, signals variability within range, shows outliers, suitable for small data
  • Stem and leaf displays: sort data into groups (stem) and values within each group (leaf), not suitable for?large data
  • Frequency distributions: data arranged into intervals and see how many data fall into each, which helps to?quickly review the data, each interval has the same width

Width ~= range/number of desired class

Usually, 5 to 15 groups are there

Class boundaries must be mutually exclusive (intervals should not overlap)

Class must be collectively exhaustive (all observations should fall into one and only one?

interval)

Round up the interval width to get desirable endpoints

  • Relative frequency distribution, percentage distribution, cumulative distribution: good to?

plot within each class, (relative frequency=frequency in each class/total frequency), used to?

compare samples with different sample sizes

Graphing frequency distribution:

  • Histograms
  • Polygons
  • Ogives

Two categorical variables:

  • Contingency table/cross-tabulation: one categorical in a row, one in the column, and the intersection has?

common values or percentages have row total and column total?

  • Side-by-side bar charts: two categorical variables plotted side by side
  • Stacked bar charts: each bar represents one categorical variable, and in each bar, there is a division of?another categorical variable

Two numerical data:

  • Scatter diagrams
  • Time series plot

Numerical descriptive measures:

Measure of central tendency:

  • Arithmetic mean: aka average, one very high or very low value can skew the mean making it less?Representative

Population mean (μ)

Sample mean (x)

  • Median: less affected by extreme values, median = (n+1)/2
  • Mode: highest repeating value, not affected by extreme values, dataset can be bimodal or multimodal i.e. two modes or multiple modes respectively
  • Percentiles:

First quartile position (Q1) = (n+1)/4, Q1 is 25th percentile i.e. 25% of data is below

this point

Second quartile position (Q2) = (n+1)/2 aka median, Q2 if 50th percentile, i.e. 50% data?

the site below this value

Third quartile position (Q3) = 3(n+1)/4, Q3 is 75th percentile, i.e. 75% of data sits?

below this value

If Q is discrete, it's fine, if Q is sth.5, take the midpoint or roundoff to the nearest value

Measure of Variation or Dispersion:?

  • Range: largest value - smallest value, tells little about how data is spread, affected highly by?

extreme values

  • Interquartile range: Q3-Q1, resistant to the effect of extreme values, captured location where?most of the data is located
  • Variance: how data is distributed around the mean, affected by extreme outliers, is an absolute?measure

Sample mean = sum of the square of the distance of every data point and the mean /?

(n-1)

Standard deviation: square root of sample variance, affected by extreme outliers, is an absolute?

measure

Coefficient of variation: a relative measure of variance, CV= (SD/mean)*100%

Z scores: how far a value sits from the mean, measured in standard deviations, Z=(X-mean)/SD

Measure of shape:?

Skewness: measures the degree of asymmetry in a distribution

  • Symmetrical: bell-shaped, lower and upper halves mirror each other, mean = median
  • Negative of left-skewed: long tail to the left, indicated few low values and more high?

values, mean < median

  • Positive or right skewed: long tail to the right, indicates more low values and more high?

values, mean > median?

Kurtosis: measured degree of Preakness or flatness of a distribution

  • Leptokurtic: peaked distribution
  • Platykurtic: flat distribution

How do we identify the shape of a distribution??

  • Five-number summary: Xsmallest, Q1, Median, Q3, Xlargest
  • Box and whisker plot: graphical representation of five-number summary

Where does the data fit within the distribution relative to mean and SD??

  • Empirical rule: if the data distribution is approximately bell-shaped or normal

> μ ± 1σ contains about 68.26% of the values of the population.

> μ ± 2σ contains about 95.44% of the values of the population.

> μ ± 3σ contains about 99.73% of the values of the population.

Z score: measures how many SD the data is from the mean

  • Chebyshev’s rule: if the data distribution is not bell-shaped or not normally distributed

Detecting outliers:

  • Symmetrical distribution: if data lies outside the range μ ± 3σ?
  • Skewed distribution: Turkey’s 1.5 IQR rule: if the data lies above (Upper fence =Q3+1.5xIQR)? or?

below (Lower fence =Q1-1.5xIQR).

Numerical descriptive measures from frequency distribution:?

  • Mean: sum (midpoint * frequency) in each interval/sample size
  • Variance: refer formula
  • SD: refer formula

Descriptive measures for the relationship between two variables:

  • Covariance, Cov(X, Y): measures strength and direction of the linear relationship between X and Y,?scale-dependent so hard to use across different samples for comparison, has no range, and can be any real?number

The strength of the relationship: how strongly one influences other

The direction of the relationship: is the influence positive or negative

  • The coefficient of correlation ( r ): measures the relative strength of the linear relationship, range -1 to 1,?closer to 1 is stronger, and 0 means no correlation, a standardized value making it more suitable

Probability and discrete probability distribution:

Probability: possibility an event will occur, ranging between 0 and 1, 3 approaches of assigning probability to an event:

A priori classical probability: based on prior knowledge

Empirical classical probability: based on observed data or study

Subjective probability: individual judgment or opinion, less precise than the above two

Events: a particular event from a random experiment, e.g. getting a 3 by rolling a dice

Simple event (A): outcome with a single characteristic A, e.g. Rolling a 6 on a single die.

Complement of an event (A’): all outcomes that do not include A, e.g. If event A is "rolling a 6 on a die,"?then A' is "rolling any number other than 6."

Joint event (AnB): two or more characteristics simultaneously (AnB), e.g. Rolling a 6 and flipping a coin that lands on heads

Mutually exclusive events: events cannot occur together, e.g. getting both head and tail in a single coin?toss, P(A or B) = P(A) + P(B) = 1

Collectively exhaustive events: events cover or exhaust all possible outcomes or at least one of them must?occur,e.g. When rolling a die, the events of rolling a 1, 2, 3, 4, 5, or 6 are collectively exhaustive, P(A) + P(B) + P(C) + ... = 1

Sample space: a collection of all possible events, e.g. 6 faces of dice

Visualizing events:

  • Contingency tables: Compare two categorical variables, the row has an event and its complement, and the?a column has another event and its complement, a cell represents joint events, and the total at the bottom is?sample?space, marginal probabilities on rows and columns, joint on the intersection
  • Venn diagrams: an intersection is a joint event, the outside area is a complement of the event and the?bubble?size shows the number of outcomes

Marginal, Joint, and Conditional probability:

Marginal probability p(A): the probability of an event occurring, unconditional, e.g. probability of?

drawing red card p(red)=0.5

p(A) = p(A|B1)*p(B1)+p(A|B2)*p(B2)+.....................+p(A|Bn)*p(Bn)

Joint probability p(AnB): Probability of events A and B occurring together, e.g. the drawn card is both?

red and is a four?

P(A n B) = P(A|B) *P(B) (if statically dependent)?

Or

P(A n B) = P(A) *P(B) (if statically independent)

Conditional probability p(A|B): Probability of event A occurring given that B has occurred, e.g. given that you drew a red card, what is the probability that it is four?

p(A|B)=p(A n B)/p(B) (probability of A given B has occurred)

Probability rules:

General addition rule: p(A or B)=p(A)+p(B)-p(A and B) (either A or B occurs)

If events are mutually exclusive, p(A or B)=p(A)+p(B)

Conditional probability: p(A|B)=p(A and B)/p(B) (probability of A given B has occurred)

Decision tree technique: the end of branches = marginal probability, sub-branches = joint probability

Statical independence: two events are statically independent if:

p(A|B) = p(A) or p(B|A)= p(B)

I.e. probability of one event is not affected by the other

E.g. result of first coin toss doesn’t affect the result of second coin toss

Multiplication rule:

P(A n B) = P(A|B) *P(B) (if statically dependent)?

Or

P(A n B) = P(A) *P(B) (if statically independent)

Baye’s theorem:

An extension of conditional probability

A technique used to revise previously calculated probability if new information is added

p(Bi|A) = (p(A|Bi)*p(Bi))/p(A)

Used when given p(A|B) and we need to calculate p(B|A)

Counting rules:

Rule 1:

If K is a mutually exclusive and collectively exhaustive event and can occur n times, K^n

E.g. tossing a coin 5 times, the number of possible outcomes is 2^5=32

Rule 2:

If K1 events on 1st trial, K2 events on second trial,...Kn events on nth trial, possible outcome is?

K1*K2*.....Kn

E.g. license number has 3 letters followed by 3 digits, possible combination is:?

26^3*10^3=117576000

Rule 3:

Number of ways a set of n items can be arranged in order

N! = n*(n-1)....(1), n! is the n factorial

E.g. set of 3 textbooks to be placed on shelf, the book can be arranged in 3!=3*2*1=6 ways

Rule 4: (permutation)

In how many ways a sub-group of the entire group can be arranged in order?

If x items need to be arranged IN AN ORDER? and selected from n items, the order is:

nPx = n!/(n-x)!

E.g. from 6 textbooks, only 4 textbooks can fit on a shelf, the book can be arranged on a shelf in

6P4=6!/(6-4)!=360 ways?

Rule 5: (combination)

In how many ways a subgroup can be arranged but NOT INTERESTED IN ORDER from a group

X items selected from n items irrespective of order can be arranged in:

nCx=n!/(x!*(n-x)!)

E.g. choose 4 textbooks from 6 textbooks to place on a shelf

6C4=6!/(4!*(6-4!))=15 ways

Probability distributions:

An equation that associates a particular probability of occurrence with each outcome in the sample space

Probability distribution of discrete random variables:

Is a mutually exclusive list of all possible numerical outcomes of a random variable with?

a probability of occurrence associated with each outcome

Expected value or mean = E(X)=μ=sum(Xi*p(Xi)) where i = 1 to N

E.g. Binomial distribution, Poisson distribution

Binomial distribution:

Scenarios with outcome of SUCCESS or failure, REPEATED over a series

E.g. manufacturing plant labels items as defective or acceptable

4 properties:

  • Has fixed number of observations: 15 coin tosses
  • Two mutually exclusive and collectively exhaustive categories: head or?tail in a coin flip
  • Constant probability for each observation: p(head) is same every time?we toss a coin
  • Observations are independent: outcome of one doesn’t affect outcome?of other

Q. a customer has 35% probability of making a purchase, 10 customer enter?shop, what is the probability of 3 customers making a purchase??

p=0.35, n=10, p(3)=???

p(x=3)=(10!/(3!(10-3)!))*0.35^3*(1-0.35)^(10-3)=0.2522

mean=np

Variance = np(1-p)

SD=sqrt(variance), n is sample size, p is probability of success

Poisson distribution:

Number of time an incidence occurs in an interval of time or space

E.g. number of enquiries in an hour

Scenarios:

Probability of event occurring in any interval is same for all intervals of?same size

Number of occurrence of event in one interval is independent to others

As the interval becomes smaller, two or more occurrence of even in the interval approached zero

Formula:

Mean = λ

Variance = λ

Standard deviation = sqrt(λ)

Continuous probability distribution:

E.g. normal distribution, uniform distribution, exponential distribution

Since the random variable takes a continuous value, trying to find the probability of a?the particular point is almost equal to 0, so we find the area of curve under the curve to determine if the random value is within the range or not

Normal distribution:

Scenarios:

Most of the values are clustered around center value

E.g. measuring weight of packages in manufacturing belt, investigating?length of call in call center, etc

Most of the values are fairly constant with only little outliers

Histogram: symmetrical, bell shaped

Properties:

Mean, median and mode are equal

Center location is mean = μ

Spread is standard deviation = σ

X has infinite theoretical range of -infinity to +infinity

Standardized the normal distribution:

Mean = o, SD = 1, values above mean has +Z and below have -Z

To standardize X into Z, Z=(X-μ)/σ

Applying formula doesn’t change distribution but just the scale

Finding normal probabilities:

Probability is measured by area under the curve

Total area under the curve has P=1

Empirical rule:?

μ ± 1σ contains about 68.26% of observations.

μ ± 2σ contains about 95.44% of observations.

μ ± 3σ contains about 99.73% of observations

The z table doesn’t pinpoint the probability but gives probability less?than specified Z point

E.g. find probability of p(X<8.6) with mean 8 and SD 5

Z=(8.6-8)/5 = 0.12

From z table, P(Z<0.12) = 0.5478

How to know if data is normally distributed?

Compare with characteristics of normal distribution

Symmetrical so mean and median are equal (do 2/3 of the?values lie between the mean ±1 standard deviation?)

Bell shaped so empirical rule applies

Interquartile range approximates to 4/3 SD

Range is infinite (range is approx 6*SD)

Construct a normality plot

Box and whisker plot, histogram, polygon

Quantile-quantile plot (refer slides), line drawn is straight

Uniform distribution/rectangular distribution:

Has equal probabilities for all outcomes

Density function, f(X)=1/(a-b), a=min value of X, b=maximum value of X

Mean = (a+b)/2

Sd=sqrt((b-a)^2/12)

p(X<=c) = (c-a)/(b-a)

To find probability, calculate the area of the rectangle?

E.g. p(0.1<X<0.3)=base*height= 0.2*1=0.2

Exponential distribution:

Continuous right skewed distribution ranging from 1 to +ve infinity

Mean > median?

Scenarios: used to model time between random, independent events

E.g: time between arrival of customers in a restaurant

λ=expected number of events per interval

mean=Sd=1/λ

Q. customer arrives at the rate of 15 per hour, find probability that the arrival?time is less than 3 minutes?

Lambda = 15 per hour = 0.25 per minute

p(X<3)= 1-e^(-0.25*3) = 0.5276

Sampling distribution:

Distribution of results if we actually selected all possible samples

Helps to know how good fit is the sample to describe the populations

Central limit theorem:

as the sample size (i.e. the number of values in each sample) gets large enough, (generally n ≥ 30),?the sampling distribution of the mean is approximately normally distributed, regardless of the?shape of the distribution of the individual values in the population.

Z formula for applying sampling distribution:

Z=(Xbar-meu)/(sigma/sqrt(n))

Sampling distribution of the mean

Sampling distribution of the proportion (for categorical variables)

Pie is the proportion of item in the population, n is sample size

If n*pie >= 5 and n(1-n)>=5, we can apply normal distribution in the proportion

Inferential techniques:

A measure to draw conclusion about a population based on sample

Parametric technique: assumes the data follows a specific probability distribution

E.g. confidence interval, hypothesis testing

Non-parametric: doesn’t assume any distribution, suitable is sample is small and no normal distribution if?followed?

Confidence Interval:

Range of values around a point estimate

Based on observation from one sample but takes into consideration variation from sample to?

sample

Gives information about the closeness to unknown population parameters

Never 100% confident

Used when we have no idea about value of population parameter being investigated

Confidence interval = Point estimate +- critical value * SD

Level of confidence:

Most common are: 90%, 95%, 99% or respective alpha = 0.1, 0.05, 0.01

Also, (1-alpha)

Alpha = significance level, probability of making type 1 error

There is a tradeoff between confidence and precision (wider interval has more confidence?

but the value becomes less precise and less useful)

E.g. i am 99% confident that my salary is between 50k to 100k

I am 95% confident that my salary is between 50k to 60k

Confidence interval estimate of mean (sigma known)

Use normal distribution

Sigma is population Sd

Z is the critical value for alpha/2

Confidence interval estimate of mean (sigma unknown)

Use student-t distribution

Use sample SD, S

Degree of freedom=n-1

T table gives upper tail area of alpha

Confidence interval estimate of proportion:

Z is the critical value where n-1 is DoF and alpha/2 in each tail

Determining sample size for mean:

Appropriate sample size is required to balance the confidence and precision

Determining sample size for proportion:

Always ROUND UP!

When determining the sample size for a population proportion for a given level of?

confidence and sampling error, the closer to 0.50 that population proportion is estimated?

to be, the sample size required: is larger

Ethical issue for confidence interval:

A confidence interval estimate (reflecting sampling error) should always be?

included when reporting a point estimate.

The level of confidence should always be reported.

The sample size should be disclosed.

An interpretation of the confidence interval estimate should also be provided.

Hypothesis testing

Used when we have some idea of population parameter being evaluated

If we have: prior knowledge, prior experience, a standard, a claim

Checks if the sample statistics is consistent with assumed population parameter or not

Identify hypothesis:

Null hypothesis (Ho): has =. >=, <= sign, supports status quo

Alternative hypothesis (H1): challenges status quo, has >, < , ≠

Is a claim, a researcher is trying to prove

Non-rejection region: between mean and critical values on either sides, cannot reject null hypothesis

Rejection region: remaining region, reject null hypothesis

Errors in decision making:

Type 1 error: rejecting Ho when it is actually correct

Also called level of significance (alpha)

Type 2 error: not rejecting Ho when it is actually false

We do not control this error, difference between hypothesized and real?

value, represented by beta

Large beta = small error

Hence, we set up our hypothesis to minimize the type of error accordingly, i.e. type 1 error is more serious so we need to control it, can be controlled by increasing sample size

Conservative approach: accept Ho if only there are sufficient proofs to validate it

The p-value approach:

If p >=alpha, do not reject Ho

If p<alpha, reject Ho?

Ethical considerations for hypothesis test:

Sample data must be selected randomly to reduce selection bias and coverage?error

Humans should be informed before being surveyed

Choose level of significance and type of test before data collection

Do not cleanse data to hide observations that do not support stated hypothesis

Distinguish between statistical and practical significance

Simple linear regression:

Regression analysis is used to:

Predict value of dependent variable (Y) w.r.t. At least one dependent variable?

(X)

Explains impact of change in independent variable on the dependent variable

Simple linear regression, Yi=Bo+B1Xi+Ei

B1= mean amount of Y change for one-unit X change

Bo = mean value of Y when X=0

Ei= random error that explains difference between predicted and observed value

  • Xo is outside the range of observation and Bo explains the portion of values in Y that are not explained by X
  • R and B1 has same sign

Interpolation, Extrapolation: validation occurs only within a range of values

Measure of variation to find how well model fits in our data:

SST, sum of squares: variation Y around their mean Y

SSR, regression sum of squares: variation attributed? due to?relationship between X and Y

SSE, error sum of squares, variation attributed to factors other?than relationship between X and Y

SST= SSR + SSE

Least square method minimizes SSE.

Coefficient of determination, R-square

R-squared = SSR/SST, value ranges between 0 and 1

Standard error of estimate

Standard deviation of variability of Y around the prediction?line

Sxy = sqrt(SSE/(n-2))

Larger the SE, larger data is deviated from the regression line

Should be judged relative to the size of Y values

Residual analysis

Graphical technique to find if simple linear regression is?better fit for data or not

4 key assumptions called LINE:

Linearity: X and Y have linear relationship

Independence of errors: no errors are related to one another

Normality of error: for any given value of X, error is normally?distributed

Equal variance (homoscedasticity): Probability distribution of?error has constant variance, i.e. variability if Y will be same for low or high value of X

Inferential techniques to test if regression is good fit or not

T test for population slope

F test for significance

Confidence interval estimate for the slope

T test for the correlation coefficient

T test for population slope:

Ho: B1 = 0 (no linear relationship)

H1: B1 ≠ 0 (linear relationship exists)

Degree of freedom n-k-1 = n-2

T = (b1-B1)/Sb1? Sb1 is standard error of slope , b1 is?regression slope coefficient

T > alpha, reject Ho

F test for significance:

F = MSR/MSE where, MSR = SSR/k and MSE = SSR/(n-k-1)

Confidence interval estimate for the slope:

T test for the correlation coefficient

Rho is correlation coeff, r is sample correlation coeff

Multi linear regression:

Adjusted R-square:?

R- square never decreases with increase in X

Hence, adjusted R-square is used

It penalizes the excessive use of unimportant independent variables

Is smaller than R-square

Is useful in comparing among the models

Takes in account: sample size, number of independent variables

F test for overall significance:

Significance of individual variable:

Indications of strong collinearity

incorrect signs on the coefficients

large change in the value of a previous coefficient when a new variable?is added to the model

a previously significant variable becomes non-significant when a new?independent variable is added

the estimate of the standard deviation of the model increases when a?variable is added to the model.

VIF (variance inflation factor)

VIF = 1/(1-R-squared)

VIF = 1, uncorrelated with other X, >10 related highly with other X

Classical multiplicative time series model components:

4 components: trend, seasonal, cyclical, irregular

Trend, T: data taken over long time, persistent increase or decrease

Seasonal: regular short term fluctuations

Cyclical, C: long term wave like, repeats 2-10 years has phases:?

peak>contraction>depression>expansion

Irregular, I: random

multiplicative time-series model for annual data: Y=Ti*Ci*Ii

multiplicative time-series model with seasonal component: Y= Ti*Ci*Ii*Si

Smoothing technique:

Moving average

Exponential smoothing: moving average with weight, weight goes on dressing,?

more weight is given to recent data

Model selection tips:

Linear: if first difference are approximately equal

Quadratic: if second difference are approximately equal

Exponential: if percentage difference are approximately equal

?

?

Sushant Gautam

PhD Student at SimulaMet | Enrichment student at the Alan Turing Institute; ??

4 个月

??

要查看或添加评论,请登录

Srijana Bhusal的更多文章

社区洞察

其他会员也浏览了