ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Found issues while preparing data; now what?

Milind Zodge

Data & Cloud Executive | 25+ Years Driving Data Strategy, Architecture & Innovation | AI, DataOps, Data Reliability, Governance, Cloud & Modernization Leader

å‘å¸ƒæ—¥æœŸ: 2022å¹´6æœˆ27æ—¥

Data preparation is an essential step in the machine learning process. This step is typically followed right after exploring the problem, identifying data sources, and data descriptive exploration.?

Once you have identified the data source, you gather data from that source and conduct data exploration using various techniques like box-plot, co-relation matrix and scatter plot to understand the data. While exploring data, you see some data issues. Let us see how to tackle them. This article talks about possible solutions at a high level. You can easily follow thru and research more as needed.

Data can be incomplete

Having more data is usually better for data science. When you have only a few data elements, try to?enrich?data by getting more data attributes from different sources and using external data sources.

Data can be missing

When you have many instances where values are null or simply not present, you can use?imputation logic?for such cases.?

e.g., using mean values to fill in for missing data, updating Null values to Nan, or you may want to eliminate those instances.

Data can be untidy

When you have one column with multiple variables or variables in rows and columns, you can use various techniques like?pivot/un-pivot; the most commonly used method is the?melt and cast?process.

Data can be sparsed?

When you have sparse data, try to change data representation using techniques like the?COO matrix. If there are many zeros, then you can?normalize the data.

é¢†è‹±æŽ¨è

In praise of DIY data work

Barton Poulson, PhD 1 ä¸ªæœˆå‰

Context is Everything with Remco Broekmans

Howard Diesel 7 ä¸ªæœˆå‰

The Illusion of Averages in Statistical Analysis

Aki Kakko 5 ä¸ªæœˆå‰

Data may have high cardinality?

When you have a cardinality issue, you can use?binning?to avoid using that column, e.g., the record's primary key.

Data with varying scales?

When you have this issue, you can use the?rescaling?the attributes technique.

Data have outliers

When you have outliers, you can use?the discretization or winsorizing?technique, which assigns lesser weight to these attributes. e.g., means, out-of-range values, unknown categorical values, and binning

Data have lots of features

When you have many columns, you can reduce the data set by eliminating unwanted features, using?the univariate selection technique, and selecting features with a strong relationship with the target variable.

Data have many dimensionalities

When you have many dimension features, you can use dimension reduction techniques like?PCA??(Principal Component Analysis) to reduce dimensions but preserve data patterns.

Data & Beyond

503 ä½å…³æ³¨è€…

è®¢é˜…

Christopher Bergh

CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!

2 å¹´

Great article. There are always issues in data -- the question is have you found them before your customer?

èµž

å›žå¤

2 æ¬¡å›žåº”

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Milind Zodgeçš„æ›´å¤šæ–‡ç«

?? Breaking Down Googleâ€™s Transformer Revolution: â€œAttention is All You Needâ€ (Deep Dive)

2025å¹´3æœˆ21æ—¥

?? Breaking Down Googleâ€™s Transformer Revolution: â€œAttention is All You Needâ€ (Deep Dive)

In 2017, Google researchers published a paper titled â€œAttention is All You Needâ€ that transformed how machinesâ€¦

2 æ¡è¯„è®º
Under the Hood of DeepSeek: Harnessing Mixture of Experts and Model Distillation for Smarter AI And Data Solutions

2025å¹´2æœˆ22æ—¥

Under the Hood of DeepSeek: Harnessing Mixture of Experts and Model Distillation for Smarter AI And Data Solutions

?? In the ever-evolving world of artificial intelligence, efficiency ?? and scalability ?? are key to buildingâ€¦
Navigating the Future: Data Engineering & AI Trends to Watch in 2025

2025å¹´2æœˆ21æ—¥

Navigating the Future: Data Engineering & AI Trends to Watch in 2025

In todayâ€™s fast-paced digital landscape, data engineering and AI are no longer just enablers of innovationâ€”they are theâ€¦

2 æ¡è¯„è®º
From Data Engineering to AI-Data Engineering: Future-Proofing Your Career in the Age of AI

2025å¹´2æœˆ13æ—¥

From Data Engineering to AI-Data Engineering: Future-Proofing Your Career in the Age of AI

?? The Future of Data Engineering is Here â€“ Are You Ready? Data Engineering is evolving faster than ever. With the riseâ€¦

11 æ¡è¯„è®º
The AI Revolution: How OpenAIâ€™s Operator and Agentic AI Will Change Everything

2025å¹´1æœˆ31æ—¥

The AI Revolution: How OpenAIâ€™s Operator and Agentic AI Will Change Everything

Artificial Intelligence (AI) is no longer just a futuristic conceptâ€”itâ€™s happening right now. OpenAIâ€™s latestâ€¦

11 æ¡è¯„è®º
Embracing Gen AI at Work

2024å¹´8æœˆ30æ—¥

Embracing Gen AI at Work

In the September-October 2024 issue of the Harvard Business Review, H. James Wilson and Paul R.

1 æ¡è¯„è®º
Navigating the Future of Data Management in a Rapidly Evolving Landscape

2024å¹´8æœˆ13æ—¥

Navigating the Future of Data Management in a Rapidly Evolving Landscape

In todayâ€™s digital era, data management is no longer just a back-office function; itâ€™s a strategic pillar forâ€¦

1 æ¡è¯„è®º
Understanding Data Fabric vs. Data Mesh

2024å¹´7æœˆ30æ—¥

Understanding Data Fabric vs. Data Mesh

Introduction In the rapidly evolving world of data management, two concepts have garnered significant attention: Dataâ€¦
Apache Iceberg with Snowflake: A Comprehensive Guide

2024å¹´7æœˆ18æ—¥

Apache Iceberg with Snowflake: A Comprehensive Guide

Introduction Apache Iceberg is an open table format that offers reliable data management and high performance forâ€¦
2023: The Dawn of DataOps, Data Mastery, and Gen-AI Innovation

2023å¹´12æœˆ13æ—¥

2023: The Dawn of DataOps, Data Mastery, and Gen-AI Innovation

Introduction 2023 has been a transformative year in the world of technology, especially in General AI (Gen-AI), Dataâ€¦

2 æ¡è¯„è®º

See all articles

Found issues while preparing data; now what?

Milind Zodge

Data & Cloud Executive | 25+ Years Driving Data Strategy, Architecture & Innovation | AI, DataOps, Data Reliability, Governance, Cloud & Modernization Leader

Data can be incomplete

Data can be missing

Data can be untidy

Data can be sparsed?

é¢†è‹±æŽ¨è

Data may have high cardinality?

Data with varying scales?

Data have outliers

Data have lots of features

Data have many dimensionalities

Data & Beyond

503 ä½å…³æ³¨è€…

Milind Zodgeçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data-Driven Decision Making

Data Analytics : Why Data Quality is the foundation of Analytics Success ?????

?? Unraveling the Mystery of Inserting Embedded Files into Power Query??

Welcome to "Insights Unlocked" - Your Guide to Data Mastery

Welcome to "Insights Unlocked" - Your Guide to Data Mastery

Data Analytics - Finding answers in data Part 1

Difference between groupBy VS Pivot Table with simple example

Mastering Data Cleaning: Effective Techniques for Pristine Datasets

DATA IMPUTATION

3 Rarely-Taught Data Analysis Techniques That Will Set You Apart in 2025

Data can be incomplete

Data can be missing

Data can be untidy

Data can be sparsed?

é¢†è‹±æŽ¨è

Data may have high cardinality?

Data with varying scales?

Data have outliers

Data have lots of features

Data have many dimensionalities

Data & Beyond

503 ä½å…³æ³¨è€…

Milind Zodgeçš„æ›´å¤šæ–‡ç«

?? Breaking Down Googleâ€™s Transformer Revolution: â€œAttention is All You Needâ€ (Deep Dive)

Under the Hood of DeepSeek: Harnessing Mixture of Experts and Model Distillation for Smarter AI And Data Solutions

Navigating the Future: Data Engineering & AI Trends to Watch in 2025

From Data Engineering to AI-Data Engineering: Future-Proofing Your Career in the Age of AI

The AI Revolution: How OpenAIâ€™s Operator and Agentic AI Will Change Everything

Embracing Gen AI at Work

Navigating the Future of Data Management in a Rapidly Evolving Landscape

Understanding Data Fabric vs. Data Mesh

Apache Iceberg with Snowflake: A Comprehensive Guide

2023: The Dawn of DataOps, Data Mastery, and Gen-AI Innovation

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Data-Driven Decision Making

Data Analytics : Why Data Quality is the foundation of Analytics Success ?????

?? Unraveling the Mystery of Inserting Embedded Files into Power Query??

Welcome to "Insights Unlocked" - Your Guide to Data Mastery

Welcome to "Insights Unlocked" - Your Guide to Data Mastery

Data Analytics - Finding answers in data Part 1

Difference between groupBy VS Pivot Table with simple example

Mastering Data Cleaning: Effective Techniques for Pristine Datasets

DATA IMPUTATION

3 Rarely-Taught Data Analysis Techniques That Will Set You Apart in 2025

é¢†è‹±æŽ¨è

503 ä½å…³æ³¨è€…

?? Breaking Down Googleâ€™s Transformer Revolution: â€œAttention is All You Needâ€ (Deep Dive)

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†