Data quality of a variable: Garbage in garbage out

Data quality of a variable: Garbage in garbage out

Dealing with incomplete and missing data is a part and parcel of any data project. A data analyst should have a solid understanding of this type of data quality issue and make sure that incomplete data is dealt with first before commencing any sort of analysis. 

Some common issues and challenges

● Admin error while sourcing the data e.g. data truncation or data merge/join issues

● Not able to differentiate erroneous versus legitimate outliers e.g. salary of $3,000,0000 when the median is $80,000.

● Not able to differentiate between missing values versus legitimate NULLs/blanks

● Presence of missing values

● Not checking if the statistics generated by a variable is reliable

● Variables having erroneous values (e.g. Subject marks of some students are more than 100 though maximum mark achievable is 100)

● Variables are censored or not published (e.g. salary of an individual is not published if there are other identifiers in the dataset)

● Variables having unreliable value – e.g. non-negative variables like heights having negative values

A) How do you resolve these data quality challenges?

● Not able to differentiate between missing values versus legitimate NULLs/blanks?

  1. Check if it missing for the majority of observations?
  2. Check if the missing values are zero or blanks?

● Discuss with the data custodian about the blanks and nulls – check if it is a genuine occurrence

● How to check the quality of a variable?

  1. Descriptive statistics with full details such minimum, maximum, mean, median, quartiles, standard deviation and variance will give you an idea if there is an underlying issue with a variable
  2. Get counts of missing versus non-missing values
  3. If there are labels/classifications present – do count on each level
  4. If it is a numeric variable, then construct a histogram and check the distribution. Outliers are always problematic.
  5. Check if numerical values are outside boundary conditions
  6. Check for duplicate records
  7. Check if unique values should be unique

B) Can something be done to improve the quality of the variable where there are missing values? 

Missing values and missing variables are frequent problems that a data analyst faces during the data cleaning/exploratory phase and can be fixed by ‘imputation’ techniques.

Missing data is quite common especially:

  • when analysing survey results (non-refusal to answer the survey as a whole or specific question)
  • In most scientific research domains such as biology, medicine, climatic science due to mishandling of sample, low signal to noise ratio, measurement error, non-response or deleted values

C) Before applying imputation it is important to understand the reason why data goes missing

Missing Completely at Random (MCAR)

  • The missing value is not related to any other variable in the dataset
  • The missing value is not related to the missing values itself
  • 'Missingness' is completely unsystematic
  • The only missing data mechanism that can be verified
  • Example – data is lost when the survey forms are lost on transit - see the incomplete dataset below - the missingness is completely random
  • Can be tested by separating the missing and non-missing and examining the group characteristics

Missing at Random (MAR)

  • The missing value is related to any other variable in the dataset
  • The missing value is not related to the missing values itself
  • Cannot be tested because unable to confirm if missingness is purely due to another measured variable
  • Example – missing salary value for younger people. Missingness is based on a function of age

Missing not at Random (MNAR)

  • The missing value is related to the missing values itself (even after controlling for other variables)
  • Example – low salaried respondents not responding to the salary question
  • Impossible to verifying this method of missingness without knowing the missing values
  • First two cases, it is okay to eliminate the data with missing values depending on their occurrence – by for the third care removing observations can produce a bias in the model you are building. 

D) What are the different imputation techniques?

Below is a summary of the different kinds of imputation techniques:

Deletion

Deleting Rows

  • Removes all data from data set that has one or missing values
  • Effective where % of 'missingness' is low
  • Most cases disadvantageous - will unknowingly introduce bias if the reason for 'missingness' is not MCAR
  • Also, will have reduced sample size and power

Pairwise Deletion

  • Incomplete cases are deleted on an analysis-by-analysis basis
  • It assumes MCAR
  • Sometimes better than deleting rows
  • An example is shown in the diagram below
  • Observation 3 and 4 can be used to find the covariance between Age and Var1, but cases 2,3 and 4 will be used to find covariance between Var1 and Var2

Deleting Columns

  • Drop variable when data is missing for a large number of observations 

Imputation

Time-series

Data without trend and seasonality

Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)?This is a common statistical approach to the analysis of longitudinal repeated measures data where some follow-up observations may be missing. Longitudinal data track the same sample at different points in time. Both these methods can introduce bias in analysis and perform poorly when data has a visible trend

Data with the trend but no seasonality

Linear Interpolation This method works well for a time series with some trend but is not suitable for seasonal data

Data with a trend and with seasonality

Seasonal Adjustment + Linear Interpolation - This method works well for data with both trend and seasonality 

General Problems

Categorical

  • Mode imputation is one method but it will definitely introduce bias
  • Missing values can be treated as a separate category by itself. We can create another category for the missing values and use them at a different level. This is the simplest method.
  • Prediction models: Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable (training) and another one with missing values (test). We can use methods like logistic regression and ANOVA for prediction

Continuous

  • Linear Regression: To begin, several predictors of the variable with missing values are identified using a correlation matrix. The best predictors are selected and used as independent variables in a regression equation. The variable with missing data is used as the dependent variable.
  • Mean, Median and Mode: Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast but has clear disadvantages. One disadvantage is that mean imputation reduces variance in the dataset. 

This article intends to serve as a starting point for budding data analysts to think about including some rigorous cleaning mechanisms into their routine. I am sure I may have many points here so feel free to comment below what your thoughts are.

Take care.

要查看或添加评论,请登录

Krish Pillai的更多文章

社区洞察

其他会员也浏览了