Found issues while preparing data; now what?

Found issues while preparing data; now what?

Data preparation is an essential step in the machine learning process. This step is typically followed right after exploring the problem, identifying data sources, and data descriptive exploration.?

Once you have identified the data source, you gather data from that source and conduct data exploration using various techniques like box-plot, co-relation matrix and scatter plot to understand the data. While exploring data, you see some data issues. Let us see how to tackle them. This article talks about possible solutions at a high level. You can easily follow thru and research more as needed.

Data can be incomplete

Having more data is usually better for data science. When you have only a few data elements, try to?enrich?data by getting more data attributes from different sources and using external data sources.

Data can be missing

When you have many instances where values are null or simply not present, you can use?imputation logic?for such cases.?

e.g., using mean values to fill in for missing data, updating Null values to Nan, or you may want to eliminate those instances.

Data can be untidy

When you have one column with multiple variables or variables in rows and columns, you can use various techniques like?pivot/un-pivot; the most commonly used method is the?melt and cast?process.

Data can be sparsed?

When you have sparse data, try to change data representation using techniques like the?COO matrix. If there are many zeros, then you can?normalize the data.

Data may have high cardinality?

When you have a cardinality issue, you can use?binning?to avoid using that column, e.g., the record's primary key.

Data with varying scales?

When you have this issue, you can use the?rescaling?the attributes technique.

Data have outliers

When you have outliers, you can use?the discretization or winsorizing?technique, which assigns lesser weight to these attributes. e.g., means, out-of-range values, unknown categorical values, and binning

Data have lots of features

When you have many columns, you can reduce the data set by eliminating unwanted features, using?the univariate selection technique, and selecting features with a strong relationship with the target variable.

Data have many dimensionalities

When you have many dimension features, you can use dimension reduction techniques like?PCA??(Principal Component Analysis) to reduce dimensions but preserve data patterns.

Christopher Bergh

CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!

2 å¹´

Great article. There are always issues in data -- the question is have you found them before your customer?

要查看或添加评论,请登录

Milind Zodge的更多文章

社区洞察

其他会员也浏览了