How do you choose the appropriate imputation method for missing values in categorical data?

由人工智能和领英社区提供技术支持

此文章中的业界达人

由社区从 21 条内容中精选。了解更多

Abhishek Das

Manager@PwC | Data Scientist | LinkedIn Top Voice | Mentor
Jae Lee

Data Scientist | SAS Programmer | Clinical Programmer - Healthcare Domain (SAS, Python, R)
Sachin Sirohi

Analytics Manager| Driving Business Impact with Advanced Analytics & Predictive Modeling

1 Type of categorical data

There are two main types of categorical data: nominal and ordinal. Nominal data are data that have no inherent order or ranking, such as color, gender, or country. Ordinal data are data that have a meaningful order or ranking, such as education level, satisfaction rating, or income group. The type of categorical data affects the choice of imputation method because some methods preserve the order or ranking of the data, while others do not. For example, mean or median imputation is not suitable for nominal data, because it assigns a numerical value to a non-numerical category. Similarly, mode imputation is not suitable for ordinal data, because it ignores the order or ranking of the categories.

添加您的观点

Jae Lee

Data Scientist | SAS Programmer | Clinical Programmer - Healthcare Domain (SAS, Python, R)
举报内容
Although the two main types of categorical data are nominal and ordinal, binary data should also be included in this piece. There are different techniques for binary data imputation.

已翻译

赞
Soumya Banerjee

Data Wrangler using Python, Dataiku, SAS - at Societe Generale Global Solution Centre
举报内容
The type of imputation method will purely depend upon the business use case. Some use case may choose the most common categorical value as the replacement, while in some cases, it could be highly likely that missing values are completely new category. Someone may also consider to drop the rows with missing categorical values, provided it doesn't lead to significant loss of information. It all depends upon the problem at hand.

已翻译

赞
Zarnab Asad

Digital Marketing Expert || Data Visualization with Power BI || Website Creation and Development | SEO Specialist || Actuarial Graduate || Data Analyst || Freelancer || 12x Top Voices
举报内容
The two primary forms of categorical data are ordinal and nominal. **Nominal data** refers to classifications like color, gender, and nation that aren't ranked or in any particular order. On the other hand, **ordinal data** have a meaningful ranking or order, such as income categories, satisfaction ratings, or educational attainment. The choice of imputation techniques is heavily influenced by the kind of categorical data, since certain techniques maintain the order while others do not. For example, since mean or median imputation gives numerical values to non-numerical categories, it is improper for nominal data. Similarly, because mode imputation ignores the significance of category ranking, it might not be appropriate for ordinal data.

已翻译

赞
?? Bhanu Pratap Singh

Engineer at Airbus | ExBoeing | SAFe? Practitioner
举报内容
The article effectively distinguishes nominal and ordinal data types but doesn't mention the importance of considering the data's context and domain-specific knowledge when choosing appropriate imputation methods, leaving room for misapplication.

已翻译

赞
Abhratanu Majumder

Data Analyst | MSc Business Analytics, University of Bath | Expert in Analytics & Visualization | PGT Student Ambassador | Ex-Byju's | Data Consultant, TopGlobe | Seed Investor, SynBiogenesis
举报内容
When choosing how to fill missing categorical data, consider interaction effects between variables. These occur when the relationship between two variables depends on a third. For instance, missing income data can affect customer segmentation accuracy based on age and education levels. Methods like multiple imputation, which account for uncertainty by generating plausible values considering interactions, ensure more accurate analyses.

已翻译

赞

加载更多内容

2 Proportion of missing values

The proportion of missing values affects the choice of imputation method, because some methods are more prone to introduce bias or variance than others, depending on how much information is missing. For example, mode imputation is a simple and fast method that replaces missing values with the most frequent category in the data. But if the proportion of missing values is high, mode imputation can create a false impression of homogeneity or dominance of one category, and reduce the variability and diversity of the data. Similarly, random imputation is a method that replaces missing values with a randomly selected category from the data. However, if the proportion of missing values is low, random imputation can introduce unnecessary noise or distortion to the data, and alter the original distribution and frequency of the categories.

添加您的观点

Abhishek Das

Manager@PwC | Data Scientist | LinkedIn Top Voice | Mentor
举报内容
Random imputation might not be a good imputation technique, as the sole objective is to impute with values that are as close to the real world as possible. Even when the proportion of missing values is high, random imputation won't make sense.

已翻译

赞
Anshum D.

SQL Database Administrator | Data Analytics | Transforming Business Processes through Database Management & Analytics
举报内容
If the proportion of missing values is small (e.g., less than 5%), simple imputation methods like mode or median can work well. For larger proportions, consider more sophisticated methods like multiple imputation.

已翻译

赞
Sairam Adithya

Research Intern @Avignon Universite| M.Tech AI&ML @SYMBIOSIS| Biomedical engineer| Predictive maintenance | Medical Imaging| Research writer
举报内容
If the proportion of the missing value is much less and the entire dataset is large, then dropping off the missing row would not signify much. If that is not the case then imputation is the second option. Since the data is of categorical type, imputation with the mode would be better.

已翻译

赞

3 Distribution of the data

The distribution of the data affects the choice of imputation method, because some methods are more sensitive to outliers or skewness than others, and can affect the shape and spread of the data. For example, mean or median imputation is a method that replaces missing values with the average or middle value of the data. But if the data are skewed or have outliers, mean or median imputation can shift the central tendency or location of the data, and create a misleading representation of the data. Likewise, k-nearest neighbors (KNN) imputation is a method that replaces missing values with the most similar or closest category from the data, based on some distance metric. However, if the data are sparse or have high dimensionality, KNN imputation can be inefficient or inaccurate. It can fail to capture the true similarity or proximity of the categories.

添加您的观点

Sachin Sirohi

Analytics Manager| Driving Business Impact with Advanced Analytics & Predictive Modeling
举报内容
Cleaning the data is an important part of the preprocessing step, especially when there is missing data. When there are gaps or missing numbers in a dataset, it is very important to clean the data. It means finding these missing data points and deciding how to deal with them based on good information. Different methods can be used depending on the type of dataset and the missing data. For example, imputation fills in missing values using statistical methods or the values of close data points. Removal gets rid of incomplete records. Data cleaning makes sure that the dataset is as full and correct as it can be. This makes it possible for more accurate and reliable data science and AI modeling and analysis.

已翻译

赞
Anshum D.

SQL Database Administrator | Data Analytics | Transforming Business Processes through Database Management & Analytics
举报内容
Consider the distribution of the categorical variable. If it’s highly skewed, imputing with the mode may not be appropriate. In such cases, consider using a separate category for missing values.

已翻译

赞

4 Purpose of the analysis

The purpose of the analysis affects the choice of imputation method because some methods are more compatible for certain types of analysis than others. It can also influence the validity or reliability of the results. For example, regression imputation is a method that replaces missing values with the predicted values from a regression model, based on some explanatory variables. But if the purpose of the analysis is to explore the relationship between the variables, regression imputation can introduce multicollinearity or endogeneity problems. As a result, it can bias the estimation or inference of the coefficients. Similarly, multiple imputation is a method that replaces missing values with multiple plausible values, based on some probabilistic model. However, if the purpose of the analysis is to perform simple descriptive statistics or visualization, multiple imputation can be unnecessary or complex. It may require additional steps or assumptions to pool or combine the results.

添加您的观点

Sachin Sirohi

Analytics Manager| Driving Business Impact with Advanced Analytics & Predictive Modeling
举报内容
Aligning the way you look at the data with why you're looking at it is very important for getting useful insights. Make the purpose of the analysis clear, understand the type of data you have, and then choose the right methods, such as descriptive statistics, regression, clustering, or others. Following principles and preprocessing data makes sure that it is useful. Pick metrics and visualizations that fit the purpose of the study, and understand the results in the context of the research. If necessary, repeat and improve the research, and use validation methods to make sure the results are correct. In the end, making results clear highlights how relevant and helpful they are, which helps people make smart decisions based on data insights.

已翻译

赞
Anshum D.

SQL Database Administrator | Data Analytics | Transforming Business Processes through Database Management & Analytics
举报内容
The imputation method should align with the research question. For exploratory analysis, simple methods may suffice. However, for predictive modeling, consider more advanced techniques like regression-based imputation or machine learning models

已翻译

赞

5 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Jae Lee

Data Scientist | SAS Programmer | Clinical Programmer - Healthcare Domain (SAS, Python, R)
举报内容
If the dataset is large enough and the proportion of missing values are small, then complete deletion of the rows with missing values should also be considered. Additional, if the categorical variable can be deleted if it's not significant feature.

已翻译

赞
Beatrice Waihenya

Data Science | Data Analysis | Technical Writer
举报内容
Several factors come into play when looking for the best imputation method for categorical data. I use this guide: 1. Understand the nature of missing data. Is it MAR, MCAR, or MNAR? 2. Check missing data %. If it's (<5%): Simpler methods may suffice. If (5-20%): Sophisticated techniques are needed. If (>20%): Consider if imputation is appropriate at all 3. Check variable importance: Key variables may require an advanced methods ~ Common imputation methods for categorical data: * Mode imputation: Simple, but can introduce bias. Suitable for MCAR data with a low missing % * Dummy variable approach * Random sampling: maintains distribution * Hot deck: preserves relationships * KNN: works well with MAR * Multiple imputation techniques

已翻译

赞
Sairam Adithya

Research Intern @Avignon Universite| M.Tech AI&ML @SYMBIOSIS| Biomedical engineer| Predictive maintenance | Medical Imaging| Research writer
举报内容
Predictive filling is a complex machine learning based data for imputation. In this method, we take the filled dataset for training and choose a suitable regression algorithm based on the complexity. Once it is trained, the algorithm is used to predict the missing values in the test dataset. By this way the imputation is performed and null values are filled.

已翻译

赞
Gautam Dhall

Data Scientist | MS Data Science @Columbia University | ML Researcher | 2x Kaggle Master | Ex- SDE @BofA
举报内容
When dealing with missing values incase of Time Series models, one can used either the Forward Fill Imputation or the Backward Fill Imputation. In case of Forward Fill, the missing values is replaced by the previous non- missing values whereas in Backward the missing value is replaced by next non-missing value observed after. Other Imputation such as MICE, KNN or Random forest Imputation can be used when dealing with different types of data.

已翻译

赞
Jayashree Dommara

Actively seeking Full-Time Data Roles | MSBA'24 at UIUC | Decision Scientist | Ex- Mu Sigma
举报内容
Predictive modelling can be used to predict the missing categorical values based on other features. This can be done by making sure that the feature with the missing values can be predicted based on the values of other variables or features, and you have sufficient data for training a model.

已翻译

赞

Data Cleaning

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How do you choose the appropriate imputation method for missing values in categorical data?

1

2

3

4

5

1 Type of categorical data

2 Proportion of missing values

3 Distribution of the data

4 Purpose of the analysis

5 Here’s what else to consider

Data Cleaning

给文章评分

感谢您的反馈

更多Data Cleaning相关文章

更多相关阅读内容