Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python
Nasir Uddin Ahmed
Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Dean’s List Award Recipient, Universiti Malaya.
Pipelines in Scikit-learn streamline the process of machine learning model development by chaining multiple steps, from preprocessing to model training, into one cohesive workflow. This modular approach not only enhances code readability and maintainability but also ensures that all transformations are correctly applied to both the training and testing datasets. By mastering pipelines, we can efficiently build and deploy robust machine-learning solutions.
In this article we will delve into pipelines and machine learning using Scikit-learn and Python.
Pipelines in Scikit-learn allow us to sequentially apply a list of transforms and a final estimator. This means we can chain multiple processes, from data preprocessing to the final model application, into one streamlined workflow.
Example 1: Basic Pipeline
领英推荐
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
steps = [('scaler', StandardScaler()), ('classifier', LogisticRegression())]
pipe = Pipeline(steps)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
Visualizing Pipelines
from sklearn import set_config
set_config(display='diagram')
pipe
Example 2: Complex Pipeline with Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.svm import SVC
steps = [
('scaler', StandardScaler()),
('pca', PCA(n_components=3)),
('classifier', SVC())
]
pipe2 = Pipeline(steps)
pipe2.fit(X_train, y_train)
y_pred2 = pipe2.predict(X_test)
Example 3: Column Transformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numerical_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1', 'cat_feature2']
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
]
)
final_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])
final_pipeline.fit(X_train, y_train)
final_predictions = final_pipeline.predict(X_test)