登录查看更多内容

Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python

Nasir Uddin Ahmed

Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Dean’s List Award Recipient, Universiti Malaya.

发布日期: 2024年7月31日

Pipelines in Scikit-learn streamline the process of machine learning model development by chaining multiple steps, from preprocessing to model training, into one cohesive workflow. This modular approach not only enhances code readability and maintainability but also ensures that all transformations are correctly applied to both the training and testing datasets. By mastering pipelines, we can efficiently build and deploy robust machine-learning solutions.

In this article we will delve into pipelines and machine learning using Scikit-learn and Python.

Pipelines in Scikit-learn allow us to sequentially apply a list of transforms and a final estimator. This means we can chain multiple processes, from data preprocessing to the final model application, into one streamlined workflow.

Click in this Link to get the Python Notebook.

Example 1: Basic Pipeline

领英推荐

Understanding Linear Classifiers: Part 1

Vizuara 11 个月前

Mathedu offers Python scripts for performing Gaussian…

Fabienne Ruch Chaplais 3 个月前

Creating Custom Machine Learning Models in Python:…

Crest Infotech ? 3 周前

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

steps = [('scaler', StandardScaler()), ('classifier', LogisticRegression())]

pipe = Pipeline(steps)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)

Visualizing Pipelines

from sklearn import set_config
set_config(display='diagram')

pipe

Example 2: Complex Pipeline with Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.svm import SVC

steps = [
    ('scaler', StandardScaler()), 
    ('pca', PCA(n_components=3)), 
    ('classifier', SVC())
]
pipe2 = Pipeline(steps)

pipe2.fit(X_train, y_train)
y_pred2 = pipe2.predict(X_test)

Example 3: Column Transformer

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numerical_features = ['num_feature1', 'num_feature2']
categorical_features = ['cat_feature1', 'cat_feature2']

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

final_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression())])

final_pipeline.fit(X_train, y_train)
final_predictions = final_pipeline.predict(X_test)

要查看或添加评论，请登录

Nasir Uddin Ahmed的更多文章

Confusion Matrix in the Training Process of Large Language Models (LLMs)

2025年3月19日

Confusion Matrix in the Training Process of Large Language Models (LLMs)

Large Language Models (LLMs) like GPT, BERT, and LLaMA have revolutionized AI-powered natural language processing…
The Power of Focus: How Attention Mechanisms are Revolutionizing AI

2024年11月29日

The Power of Focus: How Attention Mechanisms are Revolutionizing AI

What Is the Attention Mechanism? Think of attention as a way for machines to imitate human focus. When we read a book…
Understanding Vision-Language Models: A New Era in Multimodal AI

2024年10月22日

Understanding Vision-Language Models: A New Era in Multimodal AI

In recent years, the fields of artificial intelligence (AI) and machine learning (ML) have made significant strides…
AI Explainability: Bridging the Gap Between Complexity and Trust

2024年10月13日

AI Explainability: Bridging the Gap Between Complexity and Trust

In recent years, Artificial Intelligence (AI) has rapidly become an integral part of various industries, from…
Mastering Transfer Learning with TensorFlow Part: 1

2024年9月28日

Mastering Transfer Learning with TensorFlow Part: 1

Transfer Learning If we want to build a system using deep learning, we will need a lot of data. A significant amount of…
Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit

2024年9月14日

Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit

In today's digital era, artificial intelligence (AI) is making vast strides, integrating into everyday applications and…
End-to-End Data Engineering Project with Airflow, Python, and AWS

2024年9月8日

End-to-End Data Engineering Project with Airflow, Python, and AWS

In this blog, we’ll walk through an end-to-end data engineering project where we extract real-time data using X…
Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

2024年8月19日

Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

In today's data-driven world, extracting meaningful patterns from large datasets is essential for businesses looking to…
Beyond ML and DL: Understanding Measurement Models in Data Science

2024年8月14日

Beyond ML and DL: Understanding Measurement Models in Data Science

In data science, the focus often gravitates toward building machine learning (ML) and deep learning (DL) models to…
Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

2024年8月10日

Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

Importance of SQL as a Data Analyst SQL (Structured Query Language) is an essential tool for data analysts for several…

See all articles

Implementing End-to-End Machine Learning Pipelines Using Scikit-Learn and Python

Nasir Uddin Ahmed

Lecturer | Data Scientist | Artificial Intelligence | Data & Machine Learning Modeling Expert | Data Mining | Python | Power BI | SQL | ETL Processes | Dean’s List Award Recipient, Universiti Malaya.

领英推荐

Visualizing Pipelines

Example 2: Complex Pipeline with Dimensionality Reduction

Example 3: Column Transformer

Nasir Uddin Ahmed的更多文章

社区洞察

其他会员也浏览了

Dask: From Scratch to Scalable Analytics in Python! :)

New Python Time Series Forecasting Course!

Insertion Sort

Scikit-learn

Structural Pattern Matching in Python III

Data Visualization with Python and Bokeh. 2

Face Clustering using Python

Seaborn

Your First Python Program in Google Colab

SciPy

领英推荐

Visualizing Pipelines

Example 2: Complex Pipeline with Dimensionality Reduction

Example 3: Column Transformer

Nasir Uddin Ahmed的更多文章

Confusion Matrix in the Training Process of Large Language Models (LLMs)

The Power of Focus: How Attention Mechanisms are Revolutionizing AI

Understanding Vision-Language Models: A New Era in Multimodal AI

AI Explainability: Bridging the Gap Between Complexity and Trust

Mastering Transfer Learning with TensorFlow Part: 1

Building a Multilingual AI Assistant: Harnessing Speech Recognition, Google Gemini, and Streamlit

End-to-End Data Engineering Project with Airflow, Python, and AWS

Revealing Data Secrets: How AI and Simulation Drive Insights with the A Priori Algorithm

Beyond ML and DL: Understanding Measurement Models in Data Science

Mastering SQL: Essential Tips for Data Analysts to Optimize Performance and Drive Insights

社区洞察

其他会员也浏览了

Dask: From Scratch to Scalable Analytics in Python! :)

New Python Time Series Forecasting Course!

Insertion Sort

Scikit-learn

Structural Pattern Matching in Python III

Data Visualization with Python and Bokeh. 2

Face Clustering using Python

Seaborn

Your First Python Program in Google Colab

SciPy