Found issues while preparing data; now what?
Milind Zodge
Data & Cloud Executive | 25+ Years Driving Data Strategy, Architecture & Innovation | AI, DataOps, Data Reliability, Governance, Cloud & Modernization Leader
Data preparation is an essential step in the machine learning process. This step is typically followed right after exploring the problem, identifying data sources, and data descriptive exploration.?
Once you have identified the data source, you gather data from that source and conduct data exploration using various techniques like box-plot, co-relation matrix and scatter plot to understand the data. While exploring data, you see some data issues. Let us see how to tackle them. This article talks about possible solutions at a high level. You can easily follow thru and research more as needed.
Data can be incomplete
Having more data is usually better for data science. When you have only a few data elements, try to?enrich?data by getting more data attributes from different sources and using external data sources.
Data can be missing
When you have many instances where values are null or simply not present, you can use?imputation logic?for such cases.?
e.g., using mean values to fill in for missing data, updating Null values to Nan, or you may want to eliminate those instances.
Data can be untidy
When you have one column with multiple variables or variables in rows and columns, you can use various techniques like?pivot/un-pivot; the most commonly used method is the?melt and cast?process.
Data can be sparsed?
When you have sparse data, try to change data representation using techniques like the?COO matrix. If there are many zeros, then you can?normalize the data.
领英推è
Data may have high cardinality?
When you have a cardinality issue, you can use?binning?to avoid using that column, e.g., the record's primary key.
Data with varying scales?
When you have this issue, you can use the?rescaling?the attributes technique.
Data have outliers
When you have outliers, you can use?the discretization or winsorizing?technique, which assigns lesser weight to these attributes. e.g., means, out-of-range values, unknown categorical values, and binning
Data have lots of features
When you have many columns, you can reduce the data set by eliminating unwanted features, using?the univariate selection technique, and selecting features with a strong relationship with the target variable.
Data have many dimensionalities
When you have many dimension features, you can use dimension reduction techniques like?PCA??(Principal Component Analysis) to reduce dimensions but preserve data patterns.
CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!
2 å¹´Great article. There are always issues in data -- the question is have you found them before your customer?