Data Preparation for Machine Learning: Why it's important
image source: www.talend.com

Data Preparation for Machine Learning: Why it's important

Machine learning (ML) algorithms helps us find patterns in data. We then use these patterns to make predictions about new data points. If we feed poor-quality data into an ML model, the model will inevitably be of poor-quality, which means we do not get correct predictions.

Real-world data is often incomplete or inconsistent, with various errors and lacking in certain behaviours. Therefore, a crucial ML phase is data preparation. Data preparation is the process of constructing dataset correctly and transforming raw data into an efficient and useful format. In this post, let us briefly cover the need for data preparation and primary challenges we encounter quite often when working with real-world datasets.

Insufficient Data

If we just have a few records to be fed into an ML model, the model will probably perform poorly in the prediction stage by showing signs of either overfitting or underfitting. Unfortunately, there is no great solution for insufficient data. We may need to wait until we collect more data. Yet there are some techniques we may adopt (they are not widely applicable across all use cases). We can prefer simpler models (fewer parameters to tune) that can perform quite well with less data such as Naive Bayes classifier or logistic regression. These models can also be less susceptible to overfitting.

Ensemble techniques - which are basically several learners combined to achieve a better prediction performance than any of the learners individually - can also perform well even with insufficient data. For deep learning (DL) applications, we can benefit from pretrained models (transfer learning). Other methods include increasing the size of a dataset with data augmentation or syntetic data generation.

Excessive Data

Sounds strange, but having too much data and using too many irrelevant features may also lead to a performance decrease in ML algorithms (Curse of Dimensionality). Principal solutions include Feature Selection, Dimensionality Reduction, and Feature Engineering.

Feature Selection is the process of selecting a subset of relevant features for model building. Put differently, you only provide the features that really matter. Feature Selection techniques such as filter and wrapper methods can help models train faster, achieve higher accuracy, and reduce overfitting.

Despite some similarity to Feature Selection, Dimensionality Reduction methods such as Principal Component Analysis (PCA) transform features into a lower dimension, creating a new set of variables from existing features. Removal of redundant data can lead to a reduce in computational time and a better generalization to new data points.

Feature Engineering refers to the process of aggregating very low-level data into useful features. We basically create new features from raw data to improve the predictive power of the model.

In addition, outdated historical data may also cause poor performance as the relationship between features and labels changes over time. ML algorithms may fail to keep up due to excessive historical rows.

No alt text provided for this image

Non-Representative Data

For a model to be able to generalize well, it is vital that the training data is an accurate reflection of the true population. If we provide the model with irrelevant features, inaccurate variables, or biased (skewed) data, the model is very likely to produce poor predictions. Therefore, we need to make sure that our derived dataset is a real represantion of the population we are dealing with.

Missing Data

One of the most common situations faced in data preparation process is dealing with rows that have missing values. Incomplete observations can adversely affect the performance of ML algorithms, therefore they need to be taken care of.

We can delete rows with missing data but we may end up reducing sample size significantly, hence losing valuable information (can also cause bias).

As best practice, data imputation, that is filling in missing values based on known data, is a common technique to deal with incomplete sets. There are several imputation methods. We can use column average (there is the risk of weakening correlations between columns), median values, mode values, or nearby data points to fill in missing values (Univariate Imputation). Furthermore, we can construct regression models from other columns to predict missing values in each column (Multivariate Imputation). Using models to predict the missing data also tends to strengten correlations between columns.

Another technique is Hot-Deck Imputation, which sorts values based on a criteria and uses prior available data points for incomplete rows. This method can be quite effective when we are dealing with time-series data.

Duplicate Data

Regarding the collection of data, we might have some duplicate observations. These duplicate rows can introduce bias for our model with a skewness in the dataset. If we can detect duplicates, we can easily remove those before training a model. However, with real-time streaming applications, it can become quite difficult to de-duplicate data. In that case, we just have to account for it and live with it.

Outliers

Dealing with outliers is an important partof data cleaning to increase the accuracy of ML models. An outlier is a data point is significantly far from other data points in the same dataset. It can be an entire row or some features in a row. We can identify outliers by looking at the variance, distance from mean or a fitted line.

By general rule, points that lie more than 3 standard deviations from the mean are often considered outliers.

Once we identify outliers, we can drop those rows due to incorrect observations, cap/floor to +/- 3 standard deviations (we need to standardize the dataset first), or set outliers to mean value of that feature if only one attribute is erroneous in a row. We can even leave outliers as-is if they are legitimate observations and valuable for the model.

Conclusion

This post briefly covered a few challenges with real-world datasets and why we need to deal with these challenges. Data preparation is a crucial phase in an ML development cycle and has tremendous impact on the performance of predictive models. Therefore, we first have to construct an accurate dataset before we move onto training phase. We also should not forget that data preprocessing methods vary according to datasets and not all techniques are applicable across all cases.

In the next posts, we will see some data preprocessing implementations using Python and relevant ML libraries.

Happy Coding.

Baris Iskender







要查看或添加评论,请登录

Baris Iskender的更多文章

社区洞察

其他会员也浏览了