Every Basic Statistics Concept You Need to Know

Every Basic Statistics Concept You Need to Know

1. What is Statistics?

Statistics encompasses descriptive measures for summarizing data, methods for understanding variability, and techniques to extract insights from data. It provides tools to summarize and draw meaningful conclusions, enabling data-driven decisions. Key components include:

  • Descriptive measures to summarize data.

  • The science of uncertainty to understand variability.
  • Techniques to extract insights from data.

1.1 Types of Statistics

  1. Descriptive Statistics: Summarizes and organizes data using measures like mean, median, mode, and visualizations (e.g., histograms). Example: Summarizing website visitor numbers and average time on site.
  2. Inferential Statistics: Uses samples to draw conclusions about a population through methods like hypothesis testing and regression analysis. Example: Assessing whether a new marketing strategy increased traffic.
  3. Predictive Statistics: Employs historical data to forecast future outcomes using models or machine learning. Example: Predicting sales trends based on past data.

1.2 Key Definitions and Notations

  • Population (N): The entire group of people or objects (observations) with a common theme (e.g., all nurses in Ohio hospitals). Data from the entire population is called a census.
  • Sample (n): A small portion of the population (e.g., a sample of nurses in Ohio hospitals). It is crucial that the sample is representative, randomly selected, unbiased, and has calculable margins of error.
  • Sampling Frame: The part of the population from which you want to draw a sample. Ensure the sample is representative of the population.
  • Observation (Individual): Records or cases of the study.
  • Population Parameter: A measure that describes the entire population, not just the sample.

1.3 Types of Data

  • Variable: A characteristic of the observation (e.g., age, gender, body size).
  • Categorical/Qualitative Data: - Nominal: No order (e.g., gender, favorite brand). - Ordinal: Ranking or order (e.g., smoking frequency, age range).
  • Numerical/Quantitative Data: - Continuous (Interval): Can increase or decrease continuously (e.g., age, height, body weight). - Discrete (Ratio): Countable numbers that cannot be divided meaningfully (e.g., number of pets, number of rooms).

2. Designing a Statistical Study

2.1 Guide

  1. State a hypothesis
  2. Identify the observations (individuals) of interest.
  3. Specify the variables to measure that relate to the hypothesis.
  4. Determine whether to use the population or a sample.
  5. Ensure data is legally compliant.
  6. Collect the data.
  7. Use descriptive or inferential statistics to answer the hypothesis.
  8. Note any concerns about your data collection or analysis and make recommendations for future studies.

2.2 Lurking variable

Be aware of lurking variables that are associated with a condition but may not cause that condition. Ensure to include lurking variables in your analysis.

3. Distribution

3.1 Frequency Histogram

A frequency histogram is based on an aggregated frequency chart and is a type of bar chart. It displays how values of a variable are distributed (e.g., hair color distribution based on frequency). Applicable to ordinal, discrete, and continuous data.

Displays how values of a variable are distributed (e.g. hair color distribution based on frequency). Applicable only to ordinal (e.g. height groups), discrete and continous data (e.g. height)

It displays the frequency in “y” resp. count of cases to its variable in “x” (e.g. height by groups 56–60kg, 61–65kg); make sure that the intervals are the same

Relative frequency histogram: Displaying the data in percentages (total number of observations equal 100%). Useful for comparing groups within the study.

3.2 Types of Distribution

Distribution is the shape that is made if you draw a line along the edges of a histogram’s bars. The shape of the distribution can affect the statistical model to use for analysis.

  • Normal Distribution: Most cases in the middle, resembling a mountain (mode, mean, and median are close to each other).
  • Uniform Distribution: Bars of similar height.
  • Skewed Left Distribution: Light and short on the left, more frequent on the right.
  • Skewed Right Distribution: Light and short on the right, more frequent on the left.
  • Bimodal Distribution: Two high points, resembling a camel shape.
  • Multimodal Distribution: Multiple high points.

3.3 Additional Charts and Aspects

  • Bar Charts or Pie Charts: Suitable for categorical variables (e.g., nationality, gender).
  • Time Series: Analyze the evolution of the frequency of your variables to see if specific time periods impact the data.

Outliers: Data values that are very different from other measurements in the dataset. Be aware and exclude if necessary.

3.4 Frequency Table

A frequency table is an aggregated table where each value or class of the variable is shown as a row with its respective frequency and relative frequency. Classes should be based on scientific literature and typically range between 5 to 20.

Classes (groups of values) are determined empirically (not necessarily by the following formula) and should be based on the scientific literature. The number of classes (e.g. age ranges) can be calculated as followed:

Usually the number of classes ranges between 5 to 20.

4. Measures of Central Distribution

4.1 Central Tendencies

  • Mode: The most frequent value (e.g., 70% of football players in Spain are 29 years old, so 29 y/o is the mode).
  • Median: The value in the middle (50th percentile of the data).
  • Mean (average): Sum of all values divided by the count of those numbers. - Sample Mean (x?): Σx/n - Population Mean (μ): Σx/N

The mean is commonly used, but the median is more appropriate when outliers skew the average.

To stabilize the mean, the trimmed mean is applicable.

  • Trimmed mean: remove 5% from the top and remove 5% from the bottom. Mean is based out of the remaining data.
  • Weighted average: applicable for values such as indexes, grades where multiple factors are included in the result (e.g. homework count 40% to the grade and exams count 60% to the grade).

Normal distribution: Usually the mean, median and the mode are close to each other.

Skwered distribution: Usually the mean, median and the mode are more apart.

4.2 Measures of Variation (Dispersion)

Sometimes the values of central tendencies and summaries of a dataset is not enough as it does not include the variance of the data. It takes all observations into consideration.

What is variation? It describes much does the data vary.

  • Range: Difference between the highest and lowest values.
  • Standard Deviation (sigma resp. s or σ): The average distance of an observation from the mean. Larger standard deviation indicates greater variability.
  • Variance (s2 or σ2): How the data vary (squared) and how well the mean represents the data.
  • Coefficient of Variation (CV): Shows how much the data varies compared to the mean in percentage terms.

4.3 Percentiles and Box-Plots

  • Box Plot — Chart for displaying variance of the observations
  • Percentile: The percentage of data values that fall on or above a certain value (e.g., 75th percentile).
  • Quartile: Specific set of percentiles. - 1st Quartile: 25th percentile - 2nd Quartile: 50th percentile (median) - 3rd Quartile: 75th percentile
  • Interquartile Range (IQR): Difference between the 3rd and 1st quartiles; represents the middle 50% of the data, which represents the box of the boxplot
  • Outliers: Values lower than Q1 — (1.5 IQR) or higher than Q3 + (1.5 IQR).

4.4 Scatter Plot and Linear Correlation

Scatterplot is suitable for quantitative variables with an x/y-axis (x = independent variable; y = dependent variable). The regression line can be used for predictions when x exists can predict y.

Important to check for outliers.

Direction:

  • Positive correlation: line going up from left to right
  • Negative correlation: line going down from left to right
  • No correlation: straight line

Strength:

  • How close the dots (observations) are to the line

Direction and Strength — Correlation Coefficient r (Pearsons r):

  • Only can be used for linear regression!!!
  • Direction and strength of linear correlation with one number to figure out the correlation; numerical quantification of how correlated x,y-pairs are
  • Tells the degree of correlation: number between -1.0 and +1.0; perfect negative resp. perfect positive correlation; 0 = no correlation)

Be aware of lurking variables (take other variables into consideration; just because y correlates with x, doesn’t mean it’s the causation).

Outliers: be aware of outliers; either investigate and/or delete the case

The Coefficient of Determination (r2):

  • How good is the regression line (fit): how accurate is the prediction? The r2 helps -> tells you how better a regression line predicts the value of a dependent variable than the mean of the variable. if the residuals are closer with regression line (two variables) than the residuals of only one varible then regression is better. Gives the value in percentage, which is the percentage explaining the variation in y (e.g. 90%; and 10% is unexplained)
  • r vs r squared (2) -> r tells direction of streng of relation; r2 tells how much better regression is than the mean and secondly its r2 tells how many of the dependent variable (y) is explained by the indepmentn varialbe (x)

4.5 Empirical Rule

The empirical rule defines how much of the data lies within one, two, or three standard deviations of the mean. Applicable only to normal distribution.

  • 68% of the data: within 1 standard deviation of the mean
  • 95% of the data: within 2 standard deviation of the mean
  • 99.7% of the data: within 3 standard deviation of the mean

4.6 Probabilities and Z-Scores

  • Z-Score: Indicates how many standard deviations away from the mean an observation is.
  • Probability: The area above or below the Z-Score (e.g., the probability that students score above or below 77 points).

4.7 Inferential statistics

Inferential statistics analyze a sample to draw conclusions about the entire population .Using null hypothesis on the popullaiton to disprove the claims.

P-value: probability of getting a sample as much or more than our sample that the null hypothesis is true assuming null hypothesis is true

Statistical significance: if p-value is less than 0.05 then results is statisticlly significant and correct and is to reject the null hypothesis (unlikely, but doenst mean its not true).

4.8 Confidence Intervals

Confidence Interval: A range of values that is likely to contain the population parameter. It provides an estimated range of values which is likely to include an unknown population parameter.

Example: A 95% confidence interval means we are 95% confident that the interval contains the population parameter.

5. Regression Analysis

5.1 Types of Regression

  • Linear Regression: Models the relationship between two variables by fitting a linear equation to observed data.
  • Multiple Regression: Models the relationship between one dependent variable and two or more independent variables.
  • Logistic Regression: Used for binary classification problems.

6. Step-by-Step Checklist for Analysis

6.1 Planning and Preparation

  1. Define the Objective: Clearly state the purpose of the analysis.
  2. State a Hypothesis: Formulate a null hypothesis and an alternative hypothesis.
  3. Identify the Population: Define the population of interest.
  4. Determine the Sampling Frame: Identify the part of the population from which you will draw your sample.
  5. Select the Sample: Ensure the sample is representative, randomly selected, and unbiased.
  6. Specify Variables: Identify the variables to measure that relate to the hypothesis.
  7. Ensure Legal Compliance: Verify that data collection and analysis comply with legal and ethical standards.

6.2 Data Collection

  1. Collect the Data: Gather data from primary or secondary sources.
  2. Document Data Sources: Record the sources of your data for transparency and reproducibility.

6.3 Data Cleaning and Preparation

  1. Remove Duplicates: Ensure there are no duplicate records in the dataset.
  2. Handle Missing Values: Decide how to handle missing data (e.g., imputation, removal).
  3. Correct Errors: Identify and correct any errors or inconsistencies in the data.
  4. Normalize or Standardize Data: Apply necessary transformations to ensure data consistency.
  5. Encode Categorical Variables: Convert categorical variables into numerical format if needed.

6.4 Exploratory Data Analysis (EDA)

  1. Descriptive Statistics: Calculate measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).
  2. Visualize Data: Create visualizations such as histograms, bar charts, and box plots to understand data distribution.
  3. Identify Outliers: Detect and decide how to handle outliers.

6.5 Statistical Analysis

  1. Determine if There is a Linear Relationship: Use scatter plots to visualize relationships between variables.
  2. Calculate Pearson’s r: Measure the strength and direction of the linear relationship between variables.
  3. Perform Regression Analysis: Determine the regression equation for the regression line.
  4. Calculate the Value of r2: Assess the goodness of fit for the regression model.
  5. Conduct Hypothesis Testing: Perform appropriate tests (e.g., t-test, chi-square test, ANOVA) to test your hypothesis.
  6. Calculate Confidence Intervals: Determine the range within which the population parameter is likely to fall.
  7. Compute P-Values: Assess the statistical significance of your results.

6.6 Advanced Analysis (if applicable)

  1. Perform Advanced Statistical Tests: Conduct additional tests as needed (e.g., logistic regression, multiple regression).
  2. Apply Machine Learning Techniques: Use clustering, classification, or other machine learning methods if relevant.

6.7 Interpretation and Reporting

  1. Interpret Results: Draw conclusions based on your analysis.
  2. Document Findings: Record your findings and any limitations of the study.
  3. Make Recommendations: Provide actionable insights and recommendations based on your analysis.

6.8 Review and Validation

  1. Review Analysis: Double-check your analysis for accuracy and completeness.
  2. Validate Results: Validate your findings with additional data or methods if possible.

6.9 Presentation

  1. Prepare a Report: Create a comprehensive report detailing your methodology, analysis, and findings.
  2. Create Visualizations: Develop clear and informative visualizations to support your report.
  3. Present Findings: Present your findings to stakeholders in a clear and concise manner.

6.10 Post-Analysis

  1. Store Data Securely: Ensure that all data and analysis results are stored securely.
  2. Reflect on the Process: Evaluate the analysis process and identify areas for improvement in future projects.

要查看或添加评论,请登录

Sunny Choy的更多文章

社区洞察

其他会员也浏览了