???? #DataScience Insight: The Significance of Data Cleaning ????
Indrajit S.
Senior Data Scientist @ Citi | GenAI | Kaggle Competition Expert | PHD research scholar in Data Science
In the world of Data Science, it's often said that 80% of a data scientist's valuable time is spent simply finding, cleaning, and organizing data, which is commonly referred to as data cleaning or data cleansing.
Data cleaning is a critical step in the process of transforming raw data into meaningful insights. The quality and accuracy of data directly influence the outcomes of data analysis and interpretations. Errors, inconsistencies, and inaccuracies in data can significantly distort the results and lead to misguided strategies and erroneous decisions.
Step 1: Import Necessary Libraries
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, isnan, col
Step 2: Loading the Data
We begin by loading our dataset. For demonstration purposes, we'll use a simple CSV file:
# Python
data = pd.read_csv('your_file.csv')
# PySpark
spark = SparkSession.builder.getOrCreate()
data = spark.read.csv('your_file.csv', inferSchema=True, header=True)
Step 3: Handling Missing Values
Data often contains missing values, and these can affect our analysis. We can handle them in several ways, including deletion, imputation with the mean/median/mode, or through more complex methods such as regression.
# Python
data.dropna() # drops all rows with any NaN and Null values.
data.fillna(value) # replace all NaN and Null values with a specified value.
# PySpark
data.na.drop() # drops all rows with any NaN and Null values.
data.na.fill(value) # replace all NaN and Null values with a specified value.
Step 4: Removing Duplicates
Duplicate data can skew analysis and machine learning model results, so it's important to remove them:
# Python
data.drop_duplicates()
领英推荐
# PySpark
data.dropDuplicates()
Step 5: Dealing with Outliers
Outliers can significantly affect our statistical analysis and data modeling results. While complex methods can be used, a simple way is through the IQR method:
# Python
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) |(data > (Q3 + 1.5 * IQR))).any(axis=1)]
Step 6: Data Type Conversion
Data type conversion is a common task in data preprocessing. We often need to convert categorical data to numerical data, and vice versa:
# Python
data['column_name'] = data['column_name'].astype('category')?
# PySpark
data = data.withColumn("column_name", data["column_name"].cast("double"))
Step 7: Normalization or Standardization
These steps are required to scale the values of different features to a similar range:
# Python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data['column_name'] = scaler.fit_transform(data['column_name'])
# PySpark
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="column_name", outputCol="scaled_column_name")
data = scaler.fit(data).transform(data)