登录查看更多内容

Move Faster your ML Pipeline

Lakshminarasimhan S.

~1 Billion Impressions | StoryListener | Polymath | PoliticalCritique | Agentic RAG Architect | Strategic Leadership | R&D

发布日期: 2022年2月7日

Machine Learning Pipelines

Machine learning pipelines consist of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment. ... It encapsulates all the learned best practices of producing a machine learning model for the organization's use-case and allows the team to execute at scale.

Overview of Ploomber

Ploomber is a framework for faster data pipelines; it integrates with Jupyter but you can use it with any other editor.

With Ploomber, you can develop maintainable, collaborative, and production-ready pipelines from day one.

Ploomber helps you build modular pipelines. A pipeline (or?DAG) is a group of tasks with a particular execution order, where subsequent (or?downstream?tasks) use previous (or?upstream) tasks as inputs.

A short video ( 6 min) will help you to understand ploomber.

Ploomber supports three types of tasks

Python functions (also known as callables)
Python scripts/notebooks (and their R equivalents)
SQL scripts

Ploomber allows you to quickly turn a collection of scripts, notebooks, or functions into a data pipeline by following three conventions:

Each task is a function, script or notebook.
Tasks declare their dependencies using an?upstream?variable.
Tasks declare their outputs using a?product?variable.

Configuring Pipeline

This example pipeline contains three tasks,?1-get.py,?2-clean.py, and?3-plot.py;?

We mention them in a?pipeline.yaml?file:

# Content of pipeline.yaml
tasks:
   # source is the code you want to execute (.ipynb also supported)
  - source: 1-get.py
    # products are task's outputs
    product:
      # scripts generate executed notebooks as outputs
      nb: output/1-get.ipynb
      # you can define as many outputs as you want
      data: output/data.csv

  - source: 2-clean.py
    product:
      nb: output/2-clean.ipynb
      data: output/clean.csv

  - source: 3-plot.py
    product:
      nb: output/3-plot.ipynbl

Refer to the API Reference for the pipeline.yaml ( full text)

Ploomber supports Python scripts, Python functions, Jupyter notebooks, R scripts, and SQL scripts.

Ploomber understands the pipeline structure from your code. The sequence number mentioned in the yaml is just for info. You can ignore mentioning the sequence.

Actually, we need to mention in the 2nd source file, 2-clean.py, what is the previous step as given in the below example, against the upstream.

For example, to clean the data, we must get it first; hence, we declare the following in?2-clean.py:

# 2-clean.py

# this tells Ploomber to execute the '1-get' task before '2-clean'
upstream = ['1-get']

Upstream and downstream are the assignment links to the chain of steps to be performed on the data pipeline.

To run the pipeline

ploomber build

To run the pipeline in debug mode

ploomber build --debug

When executing your pipeline, Ploomber injects a new cell?into each script/notebook, with new?product?and?upstream?variables that replace the original ones by extracting information from the?pipeline.yaml.

Visualization of the pipeline

ploomber plot

Debugging

Debugging the pipeline by starting an interactive session:

ploomber interact

Debugging a task

dag[task_name].debug()

Logging

领英推荐

Four Machine Learning Questions that Every Data…

Benjamin Bennett Alexander 4 周前

DABL

360DigiTMG 1 年前

Data Science Machine Learning Full Stack Roadmap??

Himanshu Ramchandani 1 年前

ploomber build --log info --log-file my.log

Where To Go Next

Tutorial - Developing Maintenable Data Pipelines

Basic concepts are covered here

Database client configuration

SQL Pipeline is documented here

Pipeline.yaml with task grids

Writing a clean notebook to use in Ploomber pipeline

Packaging and Distribution the Pipeline

Continuous Integration for Data Science with Ploomber

Further referrence.

Get Started

High-level tutorials covering the basics of Ploomber.

Use Cases

High-level descriptions of what you can build with Ploomber.

User Guide

In-depth tutorials for developing Ploomber pipelines.

Deployment

In-depth tutorials for deployment.

Cookbook

Quick reference for common patterns.

API Reference

Technical documentation.

Blogs

Blogs on Ploomber.

Community

General information about our community and the project.

Other Pipeline frameworks & libraries

class>ActionChain?- A workflow system for simple linear success/failure workflows.

Adage

?- Small package to describe workflows that are not completely known at definition time.

AiiDA

?- workflow manager with a strong focus on provenance, performance and extensibility.

Airflow

?- Python-based workflow system created by AirBnb.

Anduril

?- Component-based workflow framework for scientific data analysis.

Antha

?- High-level language for biology.

AWE

?- Workflow and resource management system with CWL support.

Balsam

?- Python-based high throughput task and workflow engine.

Bds

?- Scripting language for data pipelines.

BioMake

?- GNU-Make-like utility for managing builds and complex workflows.

BioQueue

?- Explicit framework with web monitoring and resource estimation.

Bioshake

?- Haskell DSL built on shake with strong typing and EDAM support.

Bistro

?- Library to build and execute typed scientific workflows.

Bpipe

?- Tool for running and managing bioinformatics pipelines.

Briefly

?- Python Meta-programming Library for Job Flow Control.

Cluster Flow

?- Command-line tool which uses common cluster managers to run bioinformatics pipelines.

Clusterjob

?- Automated reproducibility, and hassle-free submission of computational jobs to clusters.

Compi

?- Application framework for portable computational pipelines.

Compss

?- Programming model for distributed infrastructures.

Conan2

?- Light-weight workflow management application.

Consecution

?- A Python pipeline abstraction inspired by Apache Storm topologies.

Cosmos

?- Python library for massively parallel workflows.

Couler

?- Unified interface for constructing and managing workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

Cromwell

?- Workflow Management System geared towards scientific workflows from the Broad Institute.

Cuneiform

?- Advanced functional workflow language and framework, implemented in Erlang.

Cylc

?- A workflow engine for cycling systems, originally developed for operational environmental forecasting.

Dagobah

?- Simple DAG-based job scheduler in Python.

Dagr

?- A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic Graphs.

Dagster

?- Python-based API for defining DAGs that interfaces with popular workflow managers for building data applications.

DataJoint

?- an open-source relational framework for scientific data pipelines.

Dask

?- Dask is a flexible parallel computing library for analytics.

Dbt

?- Framework for writing analytics workflows entirely in SQL. The T part of ETL, focuses on analytics engineering.

Dockerflow

?- Workflow runner that uses Dataflow to run a series of tasks in Docker.

Doit

?- Task management & automation tool.

Drake

?- Robust DSL akin to Make, implemented in Clojure.

Drake R package

?- Reproducibility and high-performance computing with an easy R-focused interface. Unrelated to?

Factual's Drake

. Succeeded by?

Targets

Dray

?- An engine for managing the execution of container-based workflows.

eHive

?- System for creating and running pipelines on a distributed compute resource.

Fission Workflows

?- A fast, lightweight workflow engine for serverless/FaaS functions.

Flex

?- Language agnostic framework for building flexible data science pipelines (Python/Shell/Gnuplot).

Flowr

?- Robust and efficient workflows using a simple language agnostic approach (R package).

Gc3pie

?- Python libraries and tools for running applications on diverse Grids and clusters.

Guix Workflow Language

?- A workflow management language extension for GNU Guix.

Gwf

?- Make-like utility for submitting workflows via qsub.

HyperLoom

?- Platform for defining and executing workflow pipelines in large-scale distributed environments.

Joblib

?- Set of tools to provide lightweight pipelining in Python.

Jug

?- A task Based parallelization framework for Python.

Kedro

?- Workflow development tool that helps you build data pipelines.

Ketrew

?- Embedded DSL in the OCAML language alongside a client-server management application.

Kronos

?- Workflow assembler for cancer genome analytics and informatics.

Loom

?- Tool for running bioinformatics workflows locally or in the cloud.

Longbow

?- Job proxying tool for biomolecular simulations.

Luigi

?- Python module that helps you build complex pipelines of batch jobs.

Maestro

?- YAML based HPC workflow execution tool.

Makeflow

?- Workflow engine for executing large complex workflows on clusters.

Mara

?- A lightweight, opinionated ETL framework, halfway between plain scripts and Apache Airflow.

Mario

?- Scala library for defining data pipelines.

Martian

?- A language and framework for developing and executing complex computational pipelines.

MD Studio

?- Microservice based workflow engine.

MetaFlow

?- Open-sourced framework from Netflix, for DAG generation for data scientists. Python and R API's.

Mistral

?- Python based workflow engine by the Open Stack project.

Moa

?- Lightweight workflows in bioinformatics.

Nextflow

?- Flow-based computational toolkit for reproducible and scalable bioinformatics pipelines.

NiPype

?- Workflows and interfaces for neuroimaging packages.

OpenGE

?- Accelerated framework for manipulating and interpreting high-throughput sequencing data.

Pachyderm

?- Distributed and reproducible data pipelining and data management, built on the container ecosystem.

Parsl

?- Parallel Scripting Library.

PipEngine

?Ruby based launcher for complex biological pipelines.

Pinball

?- Python based workflow engine by Pinterest.

Popper

?- YAML based container-native workflow engine supporting Docker, Singularity, Vagrant VMs with Docker daemon in VM, and local host.

Porcupine

?- Haskell workflow tool to express and compose tasks (optionally cached) whose datasources and sinks are known ahead of time and rebindable, and which can expose arbitrary sets of parameters to the outside world.

Prefect Core

?- Python based workflow engine powering Prefect.

Pydra

?- Lightweight, DAG-based Python dataflow engine for reproducible and scalable scientific pipelines.

PyFlow

?- Lightweight parallel task engine.

PypeFlow

?- Lightweight workflow engine for data analysis scripting.

pyperator

?- Simple push-based python workflow framework using asyncio, supporting recursive networks.

pyppl

?- A python lightweight pipeline framework.

pypyr

?- Automation task-runner for sequential steps defined in a pipeline yaml, with AWS and Slack plug-ins.

Pwrake

?- Parallel workflow extension for Rake.

Qdo

?- Lightweight high-throughput queuing system for workflows with many small tasks to perform.

Qsubsec

?- Simple tokenised template system for SGE.

Rabix

?- Python-based workflow toolkit based on the Common Workflow Language and Docker.

Rain

?- Framework for large distributed task-based pipelines, written in Rust with Python API.

Ray

?- Flexible, high-performance distributed Python execution framework.

Redun

?- Yet another redundant workflow engine.

Reflow

?- Language and runtime for distributed, incremental data processing in the cloud.

Remake

?- Make-like declarative workflows in R.

Rmake

?- Wrapper for the creation of Makefiles, enabling massive parallelization.

Rubra

?- Pipeline system for bioinformatics workflows.

Ruffus

?- Computation Pipeline library for Python.

Ruigi

?- Pipeline tool for R, inspired by Luigi.

Sake

?- Self-documenting build automation tool.

SciLuigi

?- Helper library for writing flexible scientific workflows in Luigi.

SciPipe

?- Library for writing Scientific Workflows in Go.

Scoop

?- Scalable Concurrent Operations in Python.

Seqtools

?- Python library for lazy evaluation of pipelined transformations on indexable containers.

Snakemake

?- Tool for running and managing bioinformatics pipelines.

Spiff

?- Based on the Workflow Patterns initiative and implemented in Python.

Stolos

?- Directed Acyclic Graph task dependency scheduler that simplify distributed pipelines.

Steppy

?- lightweight, open-source, Python 3 library for fast and reproducible experimentation.

Stpipe

?- File processing pipelines as a Python library.

StreamFlow

?- Container native workflow management system focused on hybrid workflows.

Sundial

?- Jobsystem on AWS ECS or AWS Batch managing dependencies and scheduling.

Suro

?- Java-based distributed pipeline from Netflix.

Swift

?- Fast easy parallel scripting - on multicores, clusters, clouds and supercomputers.

Targets

?- Dynamic, function-oriented?

Make

-like reproducible pipelines at scale in R.

TaskGraph

?- A library to help manage complicated computational software pipelines consisting of long running individual tasks.

Temporal

Tibanna

?Tool that helps you run genomic pipelines on Amazon cloud.

Toil

?- Distributed pipeline workflow manager (mostly for genomics).

Yap

?- Extensible parallel framework, written in Python using OpenMPI libraries.

Wallaroo

?- Framework for streaming data applications and algorithms that react to real-time events.

WorldMake

?- Easy Collaborative Reproducible Computing.

Zenaton

?- Workflow engine for orchestrating jobs, data and events across your applications and third party services.

ZenML

?- Extensible open-source MLOps framework to create reproducible pipelines for data scientists.

Thanks for reading the post every day. I will get you an exciting topic tomorrow.

ESSAR -H2H

4,449 位关注者

要查看或添加评论，请登录

Lakshminarasimhan S.的更多文章

Build Memory-Enhanced AI Agents for Business Problems

2025年3月21日

Build Memory-Enhanced AI Agents for Business Problems

This document outlines a generic template for implementing AI agents with long-term memory using LangGraph and LangMem.…

1 条评论
Modi: The Most Failed PM of Independent India (A Sarcastic Tribute to the Greatest Catastrophe in Indian Governance)

2025年3月20日

Modi: The Most Failed PM of Independent India (A Sarcastic Tribute to the Greatest Catastrophe in Indian Governance)

Narendra Modi has failed spectacularly at keeping India inefficient, corrupt, and backward. Here’s a painful look at…

1 条评论
Quantum Acoustics - Phononic Waves

2025年3月19日

Quantum Acoustics - Phononic Waves

Introduction to Quantum Acoustics Quantum acoustics is an emerging field that explores the interaction of sound waves…
The Modern Ambedkar or the Modern Aurangzeb? A Nation at Crossroads

2025年3月17日

The Modern Ambedkar or the Modern Aurangzeb? A Nation at Crossroads

In the annals of India’s political history, few figures have had the audacity to claim the legacy of Dr. B.
Is US a stage for Trump's Failure?

2025年3月17日

Is US a stage for Trump's Failure?

The assertion that the U.S.
Monkeys Everywhere

2025年3月17日

Monkeys Everywhere

You're absolutely right. That crucial aspect must not be overlooked.
The Neoliberal Status Quo: An Economic Hegemony in Decline

2025年3月15日

The Neoliberal Status Quo: An Economic Hegemony in Decline

The Neoliberal Status Quo refers to the dominant economic ideology that has shaped global trade, finance, and…
India’s Digital Governance vs. America’s Social Security Anomalies: A Case of Efficiency vs. Dysfunction

2025年3月14日

India’s Digital Governance vs. America’s Social Security Anomalies: A Case of Efficiency vs. Dysfunction

For decades, the Western world has dictated global narratives on governance, economic development, and social welfare…

1 条评论
Result-as-a-Service is not just an evolution—it is a revolution

2025年3月13日

Result-as-a-Service is not just an evolution—it is a revolution

In a world where speed, efficiency, and outcomes define business success, traditional business models are undergoing a…
Lecture : Python Objects as functions

2025年3月13日

Lecture : Python Objects as functions

Deep Dive into __call__ – Making Objects Callable in Python Introduction Python is a highly flexible language that…

1 条评论

See all articles

Move Faster your ML Pipeline

Lakshminarasimhan S.

~1 Billion Impressions | StoryListener | Polymath | PoliticalCritique | Agentic RAG Architect | Strategic Leadership | R&D

Machine Learning Pipelines

Overview of Ploomber

Ploomber supports three types of tasks

Configuring Pipeline

To run the pipeline

Visualization of the pipeline

Debugging

领英推荐

Where To Go Next

Further referrence.

Other Pipeline frameworks & libraries

ESSAR -H2H

4,449 位关注者

Lakshminarasimhan S.的更多文章

社区洞察

其他会员也浏览了

Top 12 Python Skills Every Data Scientist Should Learn

Mastering XGBoost: From Basics to Advanced Techniques with a Complete Use Case

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

Deploying models from notebooks, auto-selecting the best plan, INFORMS events, and more

?? How Autoformer Tackles Time Series Challenges in Python ??

Accessing Data with iloc: Position-Based Indexing in Pandas

Top 10 Machine Learning Projects on Github

What are Panda and NumPy in data analytics?

Revolutionizing Data Analytics: Unleashing Python's Power Within Excel

Mastering the Tools of Tomorrow: Your Passport to the World of Emerging Technologies

Machine Learning Pipelines

Overview of Ploomber

Ploomber supports three types of tasks

Configuring Pipeline

To run the pipeline

Visualization of the pipeline

Debugging

领英推荐

Where To Go Next

Further referrence.

Other Pipeline frameworks & libraries

ESSAR -H2H

4,449 位关注者

Lakshminarasimhan S.的更多文章

Build Memory-Enhanced AI Agents for Business Problems

Modi: The Most Failed PM of Independent India (A Sarcastic Tribute to the Greatest Catastrophe in Indian Governance)

Quantum Acoustics - Phononic Waves

The Modern Ambedkar or the Modern Aurangzeb? A Nation at Crossroads

Is US a stage for Trump's Failure?

Monkeys Everywhere

The Neoliberal Status Quo: An Economic Hegemony in Decline

India’s Digital Governance vs. America’s Social Security Anomalies: A Case of Efficiency vs. Dysfunction

Result-as-a-Service is not just an evolution—it is a revolution

Lecture : Python Objects as functions

社区洞察

其他会员也浏览了

Top 12 Python Skills Every Data Scientist Should Learn

Mastering XGBoost: From Basics to Advanced Techniques with a Complete Use Case

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

Deploying models from notebooks, auto-selecting the best plan, INFORMS events, and more

?? How Autoformer Tackles Time Series Challenges in Python ??

Accessing Data with iloc: Position-Based Indexing in Pandas

Top 10 Machine Learning Projects on Github

What are Panda and NumPy in data analytics?

Revolutionizing Data Analytics: Unleashing Python's Power Within Excel

Mastering the Tools of Tomorrow: Your Passport to the World of Emerging Technologies