登录查看更多内容

Data quality of a variable: Garbage in garbage out

Krish Pillai

Helping people land their first data job | Successful mentee count: 118/500 | Australia ???? | Loves teaching essential data skills | Free resources to land a data job, fast

发布日期: 2019年2月17日

Dealing with incomplete and missing data is a part and parcel of any data project. A data analyst should have a solid understanding of this type of data quality issue and make sure that incomplete data is dealt with first before commencing any sort of analysis.

Some common issues and challenges

● Admin error while sourcing the data e.g. data truncation or data merge/join issues

● Not able to differentiate erroneous versus legitimate outliers e.g. salary of $3,000,0000 when the median is $80,000.

● Not able to differentiate between missing values versus legitimate NULLs/blanks

● Presence of missing values

● Not checking if the statistics generated by a variable is reliable

● Variables having erroneous values (e.g. Subject marks of some students are more than 100 though maximum mark achievable is 100)

● Variables are censored or not published (e.g. salary of an individual is not published if there are other identifiers in the dataset)

● Variables having unreliable value – e.g. non-negative variables like heights having negative values

A) How do you resolve these data quality challenges?

● Not able to differentiate between missing values versus legitimate NULLs/blanks?

Check if it missing for the majority of observations?
Check if the missing values are zero or blanks?

● Discuss with the data custodian about the blanks and nulls – check if it is a genuine occurrence

● How to check the quality of a variable?

Descriptive statistics with full details such minimum, maximum, mean, median, quartiles, standard deviation and variance will give you an idea if there is an underlying issue with a variable
Get counts of missing versus non-missing values
If there are labels/classifications present – do count on each level
If it is a numeric variable, then construct a histogram and check the distribution. Outliers are always problematic.
Check if numerical values are outside boundary conditions
Check for duplicate records
Check if unique values should be unique

B) Can something be done to improve the quality of the variable where there are missing values?

Missing values and missing variables are frequent problems that a data analyst faces during the data cleaning/exploratory phase and can be fixed by ‘imputation’ techniques.

Missing data is quite common especially:

when analysing survey results (non-refusal to answer the survey as a whole or specific question)
In most scientific research domains such as biology, medicine, climatic science due to mishandling of sample, low signal to noise ratio, measurement error, non-response or deleted values

C) Before applying imputation it is important to understand the reason why data goes missing

Missing Completely at Random (MCAR)

The missing value is not related to any other variable in the dataset
The missing value is not related to the missing values itself
'Missingness' is completely unsystematic
The only missing data mechanism that can be verified
Example – data is lost when the survey forms are lost on transit - see the incomplete dataset below - the missingness is completely random
Can be tested by separating the missing and non-missing and examining the group characteristics

Missing at Random (MAR)

The missing value is related to any other variable in the dataset
The missing value is not related to the missing values itself
Cannot be tested because unable to confirm if missingness is purely due to another measured variable
Example – missing salary value for younger people. Missingness is based on a function of age

Missing not at Random (MNAR)

The missing value is related to the missing values itself (even after controlling for other variables)
Example – low salaried respondents not responding to the salary question
Impossible to verifying this method of missingness without knowing the missing values
First two cases, it is okay to eliminate the data with missing values depending on their occurrence – by for the third care removing observations can produce a bias in the model you are building.

D) What are the different imputation techniques?

Below is a summary of the different kinds of imputation techniques:

Deletion

Deleting Rows

Removes all data from data set that has one or missing values
Effective where % of 'missingness' is low
Most cases disadvantageous - will unknowingly introduce bias if the reason for 'missingness' is not MCAR
Also, will have reduced sample size and power

Pairwise Deletion

Incomplete cases are deleted on an analysis-by-analysis basis
It assumes MCAR
Sometimes better than deleting rows
An example is shown in the diagram below

Observation 3 and 4 can be used to find the covariance between Age and Var1, but cases 2,3 and 4 will be used to find covariance between Var1 and Var2

Deleting Columns

Drop variable when data is missing for a large number of observations

Imputation

Time-series

Data without trend and seasonality

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)?This is a common statistical approach to the analysis of longitudinal repeated measures data where some follow-up observations may be missing. Longitudinal data track the same sample at different points in time. Both these methods can introduce bias in analysis and perform poorly when data has a visible trend

Data with the trend but no seasonality

Linear Interpolation This method works well for a time series with some trend but is not suitable for seasonal data

Data with a trend and with seasonality

Seasonal Adjustment + Linear Interpolation - This method works well for data with both trend and seasonality

General Problems

Categorical

Mode imputation is one method but it will definitely introduce bias
Missing values can be treated as a separate category by itself. We can create another category for the missing values and use them at a different level. This is the simplest method.
Prediction models: Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable (training) and another one with missing values (test). We can use methods like logistic regression and ANOVA for prediction

Continuous

Linear Regression: To begin, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation. The variable with missing data is used as the dependent variable.
Mean, Median and Mode: Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast but has clear disadvantages. One disadvantage is that mean imputation reduces variance in the dataset.

This article intends to serve as a starting point for budding data analysts to think about including some rigorous cleaning mechanisms into their routine. I am sure I may have many points here so feel free to comment below what your thoughts are.

Take care.

要查看或添加评论，请登录

Krish Pillai的更多文章

WEEK 11: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年3月18日

WEEK 11: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

This week’s data job scene is looking ?? Please note that - not every job with data has "data" in the title. Here are…
WEEK 10: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年3月11日

WEEK 10: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

Crossed the 1,000 mark this week?? 10 unexpected data job opportunities below: Validation Specialist | Exergy Pty Ltd |…

2 条评论
WEEK 9: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年3月4日

WEEK 9: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

May the odds be ever in your favour! ?? The job market is heating up - over 924 entry-level data roles were posted on…
WEEK 8: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年2月24日

WEEK 8: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

Booom?? It's raining junior data jobs - 812 roles advertised on SEEK and LINKEDIN over the past 7 days. As usual, there…

5 条评论
WEEK 7: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年2月17日

WEEK 7: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

Every 'no' brings you closer to a 'yes'—keep churning those applications! ?? Top 10 unusual job titles (and hidden…

3 条评论
WEEK 6: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年2月10日

WEEK 6: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

Holy smokes!?? 957 entry-level / juniorish data analyst jobs advertised on Seek & LinkedIn in the last 7 days. Top 10…

2 条评论
WEEK 5: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年2月2日

WEEK 5: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

A nice mix of jobs this time around! ?? 587 entry-level / juniorish data analyst & engineering jobs advertised on Seek…

5 条评论
WEEK 4: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年1月27日

WEEK 4: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

Bruh, Excel isn’t going anywhere ?? Low-key shocked that 224 out of ~900 data roles are still riding the Excel train…
WEEK 51, 2024 - WEEK 3, 2025: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2025年1月19日

WEEK 51, 2024 - WEEK 3, 2025: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

The job database has undergone some cool upgrades ? Over the holidays, I have been working on making the job database…

5 条评论
WEEK 50: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

2024年12月23日

WEEK 50: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

This is my last post for 2024..

See all articles

Data quality of a variable: Garbage in garbage out

Krish Pillai

Helping people land their first data job | Successful mentee count: 118/500 | Australia ???? | Loves teaching essential data skills | Free resources to land a data job, fast

Some common issues and challenges

A) How do you resolve these data quality challenges?

B) Can something be done to improve the quality of the variable where there are missing values?

C) Before applying imputation it is important to understand the reason why data goes missing

D) What are the different imputation techniques?

Deletion

Imputation

General Problems

Krish Pillai的更多文章

社区洞察

其他会员也浏览了

The Three Core Data Types Every Data Analyst Should Master

A Day in the Life of a Data Analyst

What is a Data Structure?

The Illusion of Averages in Statistical Analysis

Ways of Identifying outliers and missing values in your data during exploratory data analysis?

Demystifying Data Cleaning: Strategies for Handling Messy Data

Data Profiling and Quality Metrics: The Cornerstones of Reliable Business Intelligence

Analyzing Decision-Making: Top Five Heuristics in Data Analysis

The Art of Data Cleaning: Ensuring Accuracy in Your Analysis

What is thematic analysis and how to do it in 3 simple steps

Some common issues and challenges

A) How do you resolve these data quality challenges?

B) Can something be done to improve the quality of the variable where there are missing values?

C) Before applying imputation it is important to understand the reason why data goes missing

D) What are the different imputation techniques?

Deletion

Imputation

General Problems

Krish Pillai的更多文章

WEEK 11: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 10: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 9: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 8: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 7: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 6: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 5: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 4: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 51, 2024 - WEEK 3, 2025: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

WEEK 50: JUNIOR DATA JOB OPENINGS IN AUSTRALIA

社区洞察

其他会员也浏览了

The Three Core Data Types Every Data Analyst Should Master

A Day in the Life of a Data Analyst

What is a Data Structure?

The Illusion of Averages in Statistical Analysis

Ways of Identifying outliers and missing values in your data during exploratory data analysis?

Demystifying Data Cleaning: Strategies for Handling Messy Data

Data Profiling and Quality Metrics: The Cornerstones of Reliable Business Intelligence

Analyzing Decision-Making: Top Five Heuristics in Data Analysis

The Art of Data Cleaning: Ensuring Accuracy in Your Analysis

What is thematic analysis and how to do it in 3 simple steps