登录查看更多内容

???? #DataScience Insight: The Significance of Data Cleaning ????

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

发布日期: 2023年7月29日

In the world of Data Science, it's often said that 80% of a data scientist's valuable time is spent simply finding, cleaning, and organizing data, which is commonly referred to as data cleaning or data cleansing.

Data cleaning is a critical step in the process of transforming raw data into meaningful insights. The quality and accuracy of data directly influence the outcomes of data analysis and interpretations. Errors, inconsistencies, and inaccuracies in data can significantly distort the results and lead to misguided strategies and erroneous decisions.

Step 1: Import Necessary Libraries

import pandas as pd

from pyspark.sql import SparkSession

from pyspark.sql.functions import when, isnan, col

Step 2: Loading the Data

We begin by loading our dataset. For demonstration purposes, we'll use a simple CSV file:

# Python

data = pd.read_csv('your_file.csv')

# PySpark

spark = SparkSession.builder.getOrCreate()

data = spark.read.csv('your_file.csv', inferSchema=True, header=True)

Step 3: Handling Missing Values

Data often contains missing values, and these can affect our analysis. We can handle them in several ways, including deletion, imputation with the mean/median/mode, or through more complex methods such as regression.

# Python

data.dropna() # drops all rows with any NaN and Null values.

data.fillna(value) # replace all NaN and Null values with a specified value.

# PySpark

data.na.drop() # drops all rows with any NaN and Null values.

data.na.fill(value) # replace all NaN and Null values with a specified value.

Step 4: Removing Duplicates

Duplicate data can skew analysis and machine learning model results, so it's important to remove them:

# Python

data.drop_duplicates()

领英推荐

Dataprep - An Auto_EDA library

360DigiTMG 1 年前

Missingno

360DigiTMG 1 年前

Top 10 Tools or Applications or Libraries or Packages…

Balaji T 2 个月前

# PySpark

data.dropDuplicates()

Step 5: Dealing with Outliers

Outliers can significantly affect our statistical analysis and data modeling results. While complex methods can be used, a simple way is through the IQR method:

# Python

Q1 = data.quantile(0.25)

Q3 = data.quantile(0.75)

IQR = Q3 - Q1

data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]

Step 6: Data Type Conversion

Data type conversion is a common task in data preprocessing. We often need to convert categorical data to numerical data, and vice versa:

# Python

data['column_name'] = data['column_name'].astype('category')?

# PySpark

data = data.withColumn("column_name", data["column_name"].cast("double"))

Step 7: Normalization or Standardization

These steps are required to scale the values of different features to a similar range:

# Python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

data['column_name'] = scaler.fit_transform(data['column_name'])

# PySpark

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="column_name", outputCol="scaled_column_name")

data = scaler.fit(data).transform(data)

#Python #PySpark #DataCleaning #DataScience #machinelearning

要查看或添加评论，请登录

Indrajit S.的更多文章

Processing Large Multiline Files in Spark: Strategies and Best Practices

2024年11月10日

Processing Large Multiline Files in Spark: Strategies and Best Practices

Handling large, multiline files can be a tricky yet essential task when working with different types of data from…
Integrating a Hugging Face Model with Google Colab

2024年5月23日

Integrating a Hugging Face Model with Google Colab

Integrating models from Hugging Face with Google Colab. Install Hugging Face Transformers Install required libs…
PyTorch GPU

2023年12月23日

PyTorch GPU

Check if CUDA is Available: This command returns True if PyTorch can access a CUDA-enabled GPU, otherwise False. Get…
How to choose the right model

2023年8月4日

How to choose the right model

Choosing the right model for a machine learning problem involves multiple steps, each of which can influence the…
Machine Learning Model Monitoring

2023年3月18日

Machine Learning Model Monitoring

Machine Learning Model Monitoring ML monitoring verifies model behavior in the early phases of the MLOps lifecycle and…
How to optimise XGBOOST MODEL

2022年12月23日

How to optimise XGBOOST MODEL

How to optimise XGBOOST model XGBoost is a powerful tool for building and optimizing machine learning models, and there…

1 条评论
why you should not give too much stress on this value in ML ?

2022年9月1日

why you should not give too much stress on this value in ML ?

What is seed Seed in machine learning means the initialization state of a pseudo-random number generator. If you use…

1 条评论
Performance Tuning in join Spark 3.0

2020年10月23日

Performance Tuning in join Spark 3.0

When we perform join in spark and if your data is small in size .Then spark by default applies the broad cast join .
Spark concepts deep dive

2020年8月22日

Spark concepts deep dive

Spark core architecture To summerize it in simple line Spark runs in local and cluster and Messos mode . Image copied…

1 条评论
What Is Xgboost and How actually it works internally?

2020年7月26日

What Is Xgboost and How actually it works internally?

What Is Xgboost ? It is an efficient implementation of gradient boosting (GB). I don't think it has any new…

See all articles

???? #DataScience Insight: The Significance of Data Cleaning ????

Indrajit S.

Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science

Step 1: Import Necessary Libraries

Step 2: Loading the Data

Step 3: Handling Missing Values

Step 4: Removing Duplicates

领英推荐

Step 5: Dealing with Outliers

Step 6: Data Type Conversion

Step 7: Normalization or Standardization

Indrajit S.的更多文章

社区洞察

其他会员也浏览了

GETTING INTO DATA SCIENCE FAQS

Aggregation in Pandas DataFrame

Guide to Churn Prediction : Part 5— Graphical analysis

Data Scientist Journey with the 100 Days of Code Challenge - Part 1

Essential Tools for Aspiring Data Scientists: Your Path to Success

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Z-Order: Visualization and Implementation

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

Comparison Between SQL Joins and Python Joins

Step 1: Import Necessary Libraries

Step 2: Loading the Data

Step 3: Handling Missing Values

Step 4: Removing Duplicates

领英推荐

Step 5: Dealing with Outliers

Step 6: Data Type Conversion

Step 7: Normalization or Standardization

Indrajit S.的更多文章

Processing Large Multiline Files in Spark: Strategies and Best Practices

Integrating a Hugging Face Model with Google Colab

PyTorch GPU

How to choose the right model

Machine Learning Model Monitoring

How to optimise XGBOOST MODEL

why you should not give too much stress on this value in ML ?

Performance Tuning in join Spark 3.0

Spark concepts deep dive

What Is Xgboost and How actually it works internally?

社区洞察

其他会员也浏览了

GETTING INTO DATA SCIENCE FAQS

Aggregation in Pandas DataFrame

Guide to Churn Prediction : Part 5— Graphical analysis

Data Scientist Journey with the 100 Days of Code Challenge - Part 1

Essential Tools for Aspiring Data Scientists: Your Path to Success

Mastering Big Data Analysis with Python's Pandas: Unleash the Power of Scalable Data Processing

Z-Order: Visualization and Implementation

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

Comparison Between SQL Joins and Python Joins