Data Transformations in Machine Learning: A Deep Dive with the Breast Cancer Dataset
Aashish Singh
?? Lead ML Engineer @ Orange Business ?? MBA | Generative AI Innovator| Tech Blogger| Helping Community Grow ?? Certified ML Developer | AI Solutions Pioneer
Data transformation is a critical step in the machine learning pipeline, ensuring that data is in the optimal format for modeling. In this article, we will explore the importance of data transformations, different types of transformations, and how they can be applied to the Breast Cancer dataset using Python. We will also include visualizations to illustrate the effects of these transformations.
Why Data Transformation is Important in Machine Learning
Data transformation is vital in machine learning for several reasons:
1. Loading and Understanding the California Housing Dataset
The California Housing dataset includes features such as median income, median house value, and other socio-economic indicators for various districts in California. This dataset is widely used for regression tasks.
from sklearn.datasets import fetch_california_housing
import pandas as pd
# Load dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['target'] = housing.target
# Display the first few rows
df.head()
1. Numeric Data Type Transformations
Numeric data types include integers and floating-point values, which often require scaling or normalization to enhance model performance.
1.1 Normalization Transform
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
# Normalize data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df[['MedInc', 'AveOccup']])
# Visualizing the original and normalized data for 'MedInc'
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['MedInc'], bins=30, color='blue', alpha=0.7)
plt.title('Original Median Income')
plt.subplot(1, 2, 2)
plt.hist(normalized_data[:, 0], bins=30, color='green', alpha=0.7)
plt.title('Normalized Median Income')
plt.show()
1.2 Standardization Transform
from sklearn.preprocessing import StandardScaler
# Standardize data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df[['MedInc', 'AveOccup']])
# Visualizing the original and standardized data for 'MedInc'
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['MedInc'], bins=30, color='blue', alpha=0.7)
plt.title('Original Median Income')
plt.subplot(1, 2, 2)
plt.hist(standardized_data[:, 0], bins=30, color='red', alpha=0.7)
plt.title('Standardized Median Income')
plt.show()
1.3 Power Transform
from sklearn.preprocessing import PowerTransformer
# Power transform
pt = PowerTransformer()
power_data = pt.fit_transform(df[['MedInc', 'AveOccup']])
# Visualizing the original and power transformed data for 'MedInc'
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['MedInc'], bins=30, color='blue', alpha=0.7)
plt.title('Original Median Income')
plt.subplot(1, 2, 2)
plt.hist(power_data[:, 0], bins=30, color='purple', alpha=0.7)
plt.title('Power Transformed Median Income')
plt.show()
1.4 Polynomial Transformation
Purpose: Polynomial Transformation is used to create new features by taking existing features and raising them to a specified power (e.g., squared, cubed). This process allows machine learning models to capture more complex, non-linear relationships between features that might not be apparent when using the original features alone.
By expanding the feature set to include polynomial terms (e.g., x2x2x2, xyxyxy, etc.), polynomial transformations can enhance the model’s ability to fit the data more closely, particularly when the relationship between the features and the target variable is non-linear. This is especially useful in linear models where the original features do not fully capture the underlying patterns in the data.
When to Use:
from sklearn.preprocessing import PolynomialFeatures
# Polynomial transformation (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_data = poly.fit_transform(df[['MedInc', 'AveRooms']])
# Visualizing the relationship between original features and transformed features
plt.figure(figsize=(12, 6))
plt.scatter(df['MedInc'], df['AveRooms'], c=df['target'], cmap='viridis', alpha=0.5, label='Original Data')
plt.scatter(poly_data[:, 0], poly_data[:, 1], c=df['target'], cmap='plasma', alpha=0.5, label='Polynomial Transformed Data')
plt.title('Polynomial Transformation: Median Income vs Average Rooms')
plt.xlabel('Median Income')
plt.ylabel('Average Rooms')
plt.legend()
plt.show()
领英推荐
1.5 Quantile Transform
from sklearn.preprocessing import QuantileTransformer
# Quantile transform
qt = QuantileTransformer(output_distribution='normal')
quantile_data = qt.fit_transform(df[['MedInc', 'AveOccup']])
# Visualizing the original and quantile transformed data for 'MedInc'
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(df['MedInc'], bins=30, color='blue', alpha=0.7)
plt.title('Original Median Income')
plt.subplot(1, 2, 2)
plt.hist(quantile_data[:, 0], bins=30, color='orange', alpha=0.7)
plt.title('Quantile Transformed Median Income')
plt.show()
2. Categorical Data Type Transformations
Categorical data often needs to be encoded in a way that machine learning algorithms can process, such as converting categories into numbers.
2.1 Ordinal Transform
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import OrdinalEncoder
import matplotlib.pyplot as plt
# Load the dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
# Create a categorical variable based on 'MedInc'
df['Income_Category'] = pd.cut(df['MedInc'], bins=[0, 2, 5, np.inf], labels=['Low', 'Medium', 'High'])
# Ordinal Encoding
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
df['Income_Category_Ordinal'] = encoder.fit_transform(df[['Income_Category']])
# Visualization
plt.bar(df['Income_Category'], df['Income_Category_Ordinal'])
plt.title('Ordinal Transform of Income Category')
plt.xlabel('Income Category')
plt.ylabel('Ordinal Value')
plt.show()
2.2 One Hot Transform
Purpose: Convert categorical variables into a series of binary variables.
When to Use: When there is no intrinsic order in the categorical variables.
from sklearn.preprocessing import OneHotEncoder
# One Hot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_data = onehot_encoder.fit_transform(df[['Income_Category']])
# Visualization
plt.imshow(onehot_data, cmap='viridis', aspect='auto')
plt.title('One Hot Transform of Income Category')
plt.xlabel('Income Category')
plt.ylabel('Sample Index')
plt.colorbar(label='Binary Encoding')
plt.show()
Visualization Explanation:
2.3 Discretization Transform
Purpose: Convert continuous numeric data into discrete bins, effectively converting it into ordinal data.
When to Use: When you want to segment continuous data into categories.
from sklearn.preprocessing import KBinsDiscretizer
# Discretization
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
discretized_data = discretizer.fit_transform(df[['MedInc']])
# Visualization
plt.plot(df['MedInc'], label='Original Median Income', alpha=0.5)
plt.plot(discretized_data, label='Discretized Income', alpha=0.5)
plt.legend()
plt.title('Discretization Transform of Median Income')
plt.xlabel('Sample Index')
plt.ylabel('Income Value / Discretized Value')
plt.show()
Visualization Explanation:
Conclusion
Data transformations, particularly for categorical data, are essential steps in preparing data for machine learning models. Properly encoding categorical variables into ordinal, one-hot, or discretized formats can significantly influence the performance and interpretability of the models. The examples above demonstrate practical implementations using the California Housing dataset, providing a clear understanding of when and how to apply these transformations effectively.
These visualizations illustrate the impact of different categorical data transformations, making it easier to decide which transformation to use based on the data’s characteristics and the model’s requirements.
Stay Tuned!!
Thank you for reading!
If you’d like, add me on Linkedin!