登录查看更多内容

Data Preparation for Machine Learning: Why it's important

Baris Iskender

Category Management at METRO Türkiye

发布日期: 2020年3月9日

Machine learning (ML) algorithms helps us find patterns in data. We then use these patterns to make predictions about new data points. If we feed poor-quality data into an ML model, the model will inevitably be of poor-quality, which means we do not get correct predictions.

Real-world data is often incomplete or inconsistent, with various errors and lacking in certain behaviours. Therefore, a crucial ML phase is data preparation. Data preparation is the process of constructing dataset correctly and transforming raw data into an efficient and useful format. In this post, let us briefly cover the need for data preparation and primary challenges we encounter quite often when working with real-world datasets.

Insufficient Data

If we just have a few records to be fed into an ML model, the model will probably perform poorly in the prediction stage by showing signs of either overfitting or underfitting. Unfortunately, there is no great solution for insufficient data. We may need to wait until we collect more data. Yet there are some techniques we may adopt (they are not widely applicable across all use cases). We can prefer simpler models (fewer parameters to tune) that can perform quite well with less data such as Naive Bayes classifier or logistic regression. These models can also be less susceptible to overfitting.

Ensemble techniques - which are basically several learners combined to achieve a better prediction performance than any of the learners individually - can also perform well even with insufficient data. For deep learning (DL) applications, we can benefit from pretrained models (transfer learning). Other methods include increasing the size of a dataset with data augmentation or syntetic data generation.

Excessive Data

Sounds strange, but having too much data and using too many irrelevant features may also lead to a performance decrease in ML algorithms (Curse of Dimensionality). Principal solutions include Feature Selection, Dimensionality Reduction, and Feature Engineering.

Feature Selection is the process of selecting a subset of relevant features for model building. Put differently, you only provide the features that really matter. Feature Selection techniques such as filter and wrapper methods can help models train faster, achieve higher accuracy, and reduce overfitting.

Despite some similarity to Feature Selection, Dimensionality Reduction methods such as Principal Component Analysis (PCA) transform features into a lower dimension, creating a new set of variables from existing features. Removal of redundant data can lead to a reduce in computational time and a better generalization to new data points.

Feature Engineering refers to the process of aggregating very low-level data into useful features. We basically create new features from raw data to improve the predictive power of the model.

In addition, outdated historical data may also cause poor performance as the relationship between features and labels changes over time. ML algorithms may fail to keep up due to excessive historical rows.

Non-Representative Data

For a model to be able to generalize well, it is vital that the training data is an accurate reflection of the true population. If we provide the model with irrelevant features, inaccurate variables, or biased (skewed) data, the model is very likely to produce poor predictions. Therefore, we need to make sure that our derived dataset is a real represantion of the population we are dealing with.

Missing Data

One of the most common situations faced in data preparation process is dealing with rows that have missing values. Incomplete observations can adversely affect the performance of ML algorithms, therefore they need to be taken care of.

We can delete rows with missing data but we may end up reducing sample size significantly, hence losing valuable information (can also cause bias).

As best practice, data imputation, that is filling in missing values based on known data, is a common technique to deal with incomplete sets. There are several imputation methods. We can use column average (there is the risk of weakening correlations between columns), median values, mode values, or nearby data points to fill in missing values (Univariate Imputation). Furthermore, we can construct regression models from other columns to predict missing values in each column (Multivariate Imputation). Using models to predict the missing data also tends to strengten correlations between columns.

Another technique is Hot-Deck Imputation, which sorts values based on a criteria and uses prior available data points for incomplete rows. This method can be quite effective when we are dealing with time-series data.

Duplicate Data

Regarding the collection of data, we might have some duplicate observations. These duplicate rows can introduce bias for our model with a skewness in the dataset. If we can detect duplicates, we can easily remove those before training a model. However, with real-time streaming applications, it can become quite difficult to de-duplicate data. In that case, we just have to account for it and live with it.

Outliers

Dealing with outliers is an important partof data cleaning to increase the accuracy of ML models. An outlier is a data point is significantly far from other data points in the same dataset. It can be an entire row or some features in a row. We can identify outliers by looking at the variance, distance from mean or a fitted line.

By general rule, points that lie more than 3 standard deviations from the mean are often considered outliers.

Once we identify outliers, we can drop those rows due to incorrect observations, cap/floor to +/- 3 standard deviations (we need to standardize the dataset first), or set outliers to mean value of that feature if only one attribute is erroneous in a row. We can even leave outliers as-is if they are legitimate observations and valuable for the model.

Conclusion

This post briefly covered a few challenges with real-world datasets and why we need to deal with these challenges. Data preparation is a crucial phase in an ML development cycle and has tremendous impact on the performance of predictive models. Therefore, we first have to construct an accurate dataset before we move onto training phase. We also should not forget that data preprocessing methods vary according to datasets and not all techniques are applicable across all cases.

In the next posts, we will see some data preprocessing implementations using Python and relevant ML libraries.

Happy Coding.

Baris Iskender

要查看或添加评论，请登录

Baris Iskender的更多文章

Assortment Optimization: The Age of Artificial Intelligence

2021年2月23日

Assortment Optimization: The Age of Artificial Intelligence

Ever-changing consumer preferences, the New Retail concept, a realm of tremendous customer data, and constantly…
Predictive Analytics in Retail: The Era of Artificial Intelligence

2021年2月16日

Predictive Analytics in Retail: The Era of Artificial Intelligence

Amazon, Alibaba, and eBay. Besides operating in the retail industry, these giants have one more game-changer common…
Customer Experience Cloud: The Backbone of Seamless Journey

2020年9月1日

Customer Experience Cloud: The Backbone of Seamless Journey

If customer experience is the ultimate strategy for companies, efficiency is the first challenge to tackle. How do we…
"Phygitalization"of Restaurants: The Next-Generation Customer Experience

2020年7月20日

"Phygitalization"of Restaurants: The Next-Generation Customer Experience

As the effects of COVID-19 spread across the world, digitalization of the hospitality industry is accelerating at a…

1 条评论
Alibaba's "New Retail": The Future of Customer Experience

2020年6月11日

Alibaba's "New Retail": The Future of Customer Experience

The store is no longer just a place to purchase items. We are going out for an experience.

1 条评论
From Data to Value: Machine Learning in Retail Industry

2020年6月8日

From Data to Value: Machine Learning in Retail Industry

The spread of digitalization indicates that companies (will) have more dynamic data in their hands. Using this dynamic…
Evaluation of Neural Network Models: How to Increase Generalization

2020年4月19日

Evaluation of Neural Network Models: How to Increase Generalization

In deep learning and other areas of machine learning, the ultimate goal is to create models that generalize very well…

2 条评论

See all articles

Data Preparation for Machine Learning: Why it's important

Baris Iskender

Category Management at METRO Türkiye

Insufficient Data

Excessive Data

Non-Representative Data

Missing Data

Duplicate Data

Outliers

Conclusion

Baris Iskender的更多文章

社区洞察

其他会员也浏览了

Scenarios: Which Machine Learning (ML) to choose?

?? Exciting News: Generative models are revolutionizing data creation in machine learning projects.

Unsupervised Machine Learning in Business

Top 10 Machine Learning Algorithms Every Beginner Should Know!!

CRISP-DM Process for Machine Learning Projects

Building Intelligent Systems Integrating Machine Learning with Data Engineering

The Evolution of AI: A Journey from Data to Intelligence

Machine Learning – A Gentle Approach part 1

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING

Insufficient Data

Excessive Data

Non-Representative Data

Missing Data

Duplicate Data

Outliers

Conclusion

Baris Iskender的更多文章

Assortment Optimization: The Age of Artificial Intelligence

Predictive Analytics in Retail: The Era of Artificial Intelligence

Customer Experience Cloud: The Backbone of Seamless Journey

"Phygitalization"of Restaurants: The Next-Generation Customer Experience

Alibaba's "New Retail": The Future of Customer Experience

From Data to Value: Machine Learning in Retail Industry

Evaluation of Neural Network Models: How to Increase Generalization

社区洞察

其他会员也浏览了

Scenarios: Which Machine Learning (ML) to choose?

?? Exciting News: Generative models are revolutionizing data creation in machine learning projects.

Unsupervised Machine Learning in Business

Top 10 Machine Learning Algorithms Every Beginner Should Know!!

CRISP-DM Process for Machine Learning Projects

Building Intelligent Systems Integrating Machine Learning with Data Engineering

The Evolution of AI: A Journey from Data to Intelligence

Machine Learning – A Gentle Approach part 1

XGBOOST CLASSIFIER ALGORITHM IN MACHINE LEARNING