登录查看更多内容

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

Satish Chandra Gupta

Data/AI Consultant ? I help startups & SMBs build effective, economical, and scalable data/ML/LLM-powered products. ? Ex- Amazon, Microsoft Research

发布日期: 2022年10月8日

“Pipeline” is an overloaded term in data science and machine learning. People mean different things when they talk about data pipelines or ML pipelines. This issue covers the 3 most common kinds of pipelines: data pipelines, machine learning pipelines, and MLOps pipelines.

Data Pipelines

Data pipelines are the most ancient among the 3. These have been around for over 25 years. It has two variants: Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT).

ETL pipelines transform the data before storing it. ELT pipelines clean the data, store it as close to raw form as feasible, and then run transformations (and store their results too). As the transformation step evolves, ELT pipelines can simply rerun it on the raw data.

Data Pipelines?automate the process of collecting, cleaning, and persisting data in a Data Warehouse or Data Lake. It has 5 stages:

Collect?data from internal and external sources
Ingest?data through batch jobs or streams
Store?data (after cleaning) in a data warehouse or data lake
Compute?analytics aggregations and ML features
Use?it in BI dashboards, data science, ML, etc.

Batch Data Pipelines process data in batch at a configured frequency. Streaming Data Pipelines process data events in real time as they arrive. You can build data pipelines on-premise or on the cloud.

Machine Learning Pipelines

Machine Learning has two phases:

Training:?Run experiments to train several machine learning models, tune hyper-parameters, and select the best model. It has?load-transform-fit?steps.
Inference:?Predict the outcome using the model on new input data. It has?load-transform-predict?steps.

Each of these steps can be coded separately. Data Scientists often do that, especially when using Jupyter Notebooks. It causes two problems:

Code Comprehension:?It is difficult to understand the code there transformation is scattered all over the place.
Train-Infer Bugs:?When the data transformations for training and inference are scattered and duplicated, it is quite possible that one is modified while the other is not. That can cause silent bugs.

Almost all major Machine Learning frameworks offer a way to define a chain of data transformations to perform feature extraction and selection, which is then fed to the?fit?or?predict?step:

Data & Analytics 4 个月前

TransmogrifAI

360DigiTMG 1 年前

Machine Learning and Big Data: Are They the Future?

Analytics Insight? 4 个月前

These ML pipelines are sometimes called Data Input pipelines.

MLOps Pipelines

MLOps pipelines orchestrate machine learning workflows in?MLOps Lifecycle?for continuous integration, deployment, and training (CI/CD/CT) of ML models.

Here is the?MLOps pipeline suggested by Google:

All major model deployment infrastructure/frameworks offer MLOps pipelines:

Fuzzy Boundaries

The boundaries between the 3 aren’t cast in stone. Parts of the ML pipelines for feature extraction are often moved to the Data Pipeline. Data standardization, normalization, encoding, etc., are often done close to the model training, so remain in the ML pipeline. However, hyperparameter selection (e.g. Grid Search) is better suited to be in the MLOps pipeline.

The goal of this issue is to give an overview of the landscape and options. You choose what is most suitable for your use case.

ML4Devs is a biweekly newsletter for software developers. The aim is to curate resources for practitioners to design, develop, deploy, and maintain ML applications at scale to drive measurable positive business impact. Each issue discusses a topic from a developer’s viewpoint.

Enjoyed this? Originally published in?ML4Devs.com. Don't miss the next issue. Join 1.3K+ subscribers and?get it in your email:

ML4Devs

8,867 位关注者

要查看或添加评论，请登录

Satish Chandra Gupta的更多文章

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

2022年12月21日

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

As 2022 winds down, it is time to look back, learn, and find ways to do it better next year. Can you please help me in…
SQL Renaissance (ML4Devs Newsletter, Issue 17)

2022年11月26日

SQL Renaissance (ML4Devs Newsletter, Issue 17)

While I was in college in the 1990s, many of us considered databases as a solved problem. And by extension, wrote off…
Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

2022年11月11日

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

You care about the data. Actually, you really care about the insights from that data or ML models you train with that…

3 条评论
Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

2022年10月28日

Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

AI research produces astonishing results every week (if not daily). And it is solving important and hard problems.
AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

2022年9月23日

AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

I learned about this question in a panel discussion at a conference last week: While most leaders acknowledge AI’s role…
Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

2022年9月9日

Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

The answer, as usual, is: it depends. What is MLOps? MLOps stands for Machine Learning Operations.
Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

2022年8月18日

Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

In the previous issue, we examined the MLOps ecosystem. It is a lot more complex compared to traditional software…
MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

2022年8月5日

MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

MLOps is a hot topic and everyone seems to be talking about it. I have been reading quite a bit of material, but I will…
When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

2022年7月22日

When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

In the previous issue, I discussed why Machine Learning projects fail. In this issue, let’s start figuring out how to…

1 条评论
Why Machine Learning Projects Fail (ML4Devs Newsletter, Issue 8)

2022年7月8日

Why Machine Learning Projects Fail (ML4Devs Newsletter, Issue 8)

Machine learning and its life cycle are different from traditional programming and software development life cycle…

4 条评论

See all articles

What Are Data, Machine Learning, and MLOps Pipelines (ML4Devs Newsletter, Issue 14)

Satish Chandra Gupta

Data/AI Consultant ? I help startups & SMBs build effective, economical, and scalable data/ML/LLM-powered products. ? Ex- Amazon, Microsoft Research

Data Pipelines

Machine Learning Pipelines

领英推荐

MLOps Pipelines

Fuzzy Boundaries

ML4Devs

8,867 位关注者

Satish Chandra Gupta的更多文章

社区洞察

其他会员也浏览了

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

ML Systems for Business: A Step-by-Step Guide

AWS Machine Learning Workflow

The Impact of Machine Learning on Data Pipelines: Challenges and Opportunities

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

H2O.ai: An Open-Source Platform for Building and Deploying Machine Learning Models

Understanding Data Science vs Machine Learning for Business Innovation

What is the Difference between Data Science and Machine Learning?

Data Pipelines

Machine Learning Pipelines

领英推荐

MLOps Pipelines

Fuzzy Boundaries

ML4Devs

8,867 位关注者

Satish Chandra Gupta的更多文章

MLOps: All-in-One Platform vs Piecemeal Tools (ML4Devs Newsletter, Issue 18)

SQL Renaissance (ML4Devs Newsletter, Issue 17)

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

Chasm of AI Security Between Research and Products (ML4Devs Newsletter, Issue 15)

AI is Like Teenage?Sex… (ML4Devs Newsletter, Issue 13)

Should You Care About MLOps? Why and How Much? (ML4Devs Newsletter, Issue 12)

Machine Learning vs. Traditional Software Development (ML4Devs Newsletter, Issue 11)

MLOps for Continuous Integration, Delivery, and Training (ML4Devs Newsletter, Issue 10)

When to (Not) Use Machine Learning (ML4Devs Newsletter, Issue 9)

Why Machine Learning Projects Fail (ML4Devs Newsletter, Issue 8)

社区洞察

其他会员也浏览了

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

ML Systems for Business: A Step-by-Step Guide

AWS Machine Learning Workflow

The Impact of Machine Learning on Data Pipelines: Challenges and Opportunities

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

H2O.ai: An Open-Source Platform for Building and Deploying Machine Learning Models

Understanding Data Science vs Machine Learning for Business Innovation

What is the Difference between Data Science and Machine Learning?