登录查看更多内容

Scaling ML Dreams: A Journey Through Distributed MLOps

John Murillo-Giraldo

Senior Cloud DevOps | AWS, Azure, OpenShift & Kubernetes

发布日期: 2024年7月10日

It was 2 AM, and the soft glow of computer screens illuminated tired faces in our startup's cramped office. We had just pushed our machine learning model into production, feeling a mix of excitement and trepidation. Little did we know, this was the beginning of a transformative journey into the world of distributed MLOps.

The Awakening

Our first ML model was a recommendation engine for our e-commerce platform. It worked flawlessly in our local environment, but as soon as it hit production, things went sideways. Data skew, model drift, and scaling issues plagued us. We realized we needed a more robust, distributed approach to our ML operations.

This is a story many ML teams can relate to. The transition from a local jupyter notebook to a distributed, production-ready ML system is often a rude awakening. But fear not, for this is where the magic of MLOps in distributed systems comes into play.

The Tools of Our Trade

As we embarked on our MLOps journey, we discovered a treasure trove of tools designed to make our lives easier. Here are some that became indispensable to us:

1. Kubeflow: This became our go-to platform for deploying ML workflows on Kubernetes. Its ability to handle the entire ML lifecycle in a distributed environment was a game-changer.

2. MLflow: For experiment tracking and model versioning across our distributed setup, MLflow proved invaluable. It allowed us to compare experiments run on different nodes and reproduce results easily.

3. Apache Airflow: This powerful workflow orchestration tool helped us manage complex ML pipelines across our distributed infrastructure.

4. Feast: As our feature store, Feast enabled consistent feature management across training and serving in our distributed environment.

5. Seldon Core: For model deployment and serving in our Kubernetes cluster, Seldon Core provided the flexibility and scalability we needed.

The Chronicles of Scaling

Chapter 1: Data Wrangling at Scale

Our first challenge was dealing with data at scale. We had data coming in from multiple sources, in various formats, and at different velocities. Enter Apache Spark:

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler

# Initialize Spark session
spark = SparkSession.builder.appName("DistributedFeatureEngineering").getOrCreate()

# Read data from various sources
clicks = spark.read.parquet("hdfs://clicks_data")
user_profiles = spark.read.json("hdfs://user_profiles")
product_catalog = spark.read.csv("hdfs://product_catalog", header=True)

# Join datasets
joined_data = clicks.join(user_profiles, "user_id").join(product_catalog, "product_id")

# Feature engineering
assembler = VectorAssembler(inputCols=["user_age", "product_price", "click_duration"], outputCol="features")
final_data = assembler.transform(joined_data)

# Save processed data
final_data.write.parquet("hdfs://processed_features")

This Spark job allowed us to process terabytes of data across our cluster, preparing it for model training.

领英推荐

MLops

Darshika Srivastava 1 年前

15+ Next Gen ML Engineering Companies – The 2025…

DataToBiz 3 个月前

Beyond Pipelines: Why Most ML Models Fail in…

Steven Murhula 4 周前

Chapter 2: Distributed Training

With our data prepared, we moved on to distributed training. We leveraged Kubeflow for this, using TensorFlow's distributed training capabilities:

import tensorflow as tf
import tensorflow_datasets as tfds

strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

dataset = tfds.load('my_dataset', split='train', as_supervised=True)
dataset = dataset.batch(32).repeat()

model.fit(dataset, epochs=10, steps_per_epoch=100)

This allowed us to train our model across multiple GPUs and even multiple nodes, significantly reducing training time.

Chapter 3: Model Versioning and Experiment Tracking

As our team grew and experiments multiplied, keeping track of everything became crucial. MLflow came to our rescue:

import mlflow
from mlflow.tracking import MlflowClient

mlflow.set_tracking_uri("https://mlflow.our-company.com")
mlflow.set_experiment("recommendation_engine")

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_metric("accuracy", 0.85)
    mlflow.sklearn.log_model(model, "model")

# Retrieve the best model
client = MlflowClient()
best_run = client.search_runs(experiment_ids=["1"])[0]
best_model = mlflow.sklearn.load_model(f"runs:/{best_run.info.run_id}/model")

This allowed us to track experiments, compare results, and easily retrieve the best performing models.

Chapter 4: Automated ML Pipelines

To tie everything together, we used Apache Airflow to create automated, distributed ML pipelines:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'mlops_team',
    'depends_on_past': False,
    'start_date': datetime(2023, 1, 1),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'ml_pipeline',
    default_args=default_args,
    description='A DAG for our ML pipeline',
    schedule_interval=timedelta(days=1),
)

def data_preprocessing():
    # Spark job for data preprocessing
    pass

def model_training():
    # Distributed training job
    pass

def model_evaluation():
    # Evaluate model performance
    pass

def model_deployment():
    # Deploy model if performance is satisfactory
    pass

t1 = PythonOperator(
    task_id='data_preprocessing',
    python_callable=data_preprocessing,
    dag=dag,
)

t2 = PythonOperator(
    task_id='model_training',
    python_callable=model_training,
    dag=dag,
)

t3 = PythonOperator(
    task_id='model_evaluation',
    python_callable=model_evaluation,
    dag=dag,
)

t4 = PythonOperator(
    task_id='model_deployment',
    python_callable=model_deployment,
    dag=dag,
)

t1 >> t2 >> t3 >> t4

This DAG automated our entire ML pipeline, from data preprocessing to model deployment, ensuring smooth operations across our distributed infrastructure.

The Happy Ending?

As we implemented these tools and practices, our ML operations transformed. We went from nightly fire-fighting to smooth, automated workflows. Our models were more accurate, our systems more robust, and our team... well, they finally got some sleep!

But in the world of ML and distributed systems, there's no real "ending." Technology evolves, new challenges arise, and we continue to adapt and improve. The journey of MLOps in distributed systems is ongoing, filled with continuous learning and optimization.

So, fellow ML enthusiasts, as you embark on your own MLOps journey, remember: the path may be challenging, but with the right tools and practices, you can turn your ML dreams into distributed realities. Happy scaling!

要查看或添加评论，请登录

John Murillo-Giraldo的更多文章

The Art of Monitoring: A DevOps and SRE Perspective

2024年8月6日

The Art of Monitoring: A DevOps and SRE Perspective

By John Murillo-Giraldo I've been in the trenches of DevOps and Site Reliability Engineering (SRE) for years now, and…
Mastering AWS: A Comprehensive Guide to Building High-Performance, Secure, and Cost-Effective API Architectures

2024年7月16日

Mastering AWS: A Comprehensive Guide to Building High-Performance, Secure, and Cost-Effective API Architectures

By Jhon Jairo Murillo Giraldo In today's digital landscape, building robust, efficient, and secure API architectures is…
Unlocking the Power of CloudFront in Your Private VPC: A Game-Changer for Secure Content Delivery

2024年7月16日

Unlocking the Power of CloudFront in Your Private VPC: A Game-Changer for Secure Content Delivery

By Jhon Jairo Murillo Giraldo In today's digital landscape, delivering content quickly and securely is paramount. But…

1 条评论
10 Steps to Launching Your Career in DevOps

2024年7月16日

10 Steps to Launching Your Career in DevOps

By Alok Kumar Are you looking to break into the exciting world of DevOps? This in-demand field bridges the gap between…
From SDLC to NoOps: Navigating the Future of Software Development

2024年7月15日

From SDLC to NoOps: Navigating the Future of Software Development

By Jhon Jairo Murillo Giraldo In today's rapidly evolving tech landscape, staying ahead means embracing change. One of…
Unleashing the Power of Mock Servers: Supercharge Your API Development in Postman

2024年7月15日

Unleashing the Power of Mock Servers: Supercharge Your API Development in Postman

By Jhon Jairo Murillo Giraldo In the fast-paced world of software development, time is of the essence. Our frontend…
PostgreSQL Table and Index Mistakes: Insights and Best Practices

2024年7月15日

PostgreSQL Table and Index Mistakes: Insights and Best Practices

By Jhon Jairo Murillo Giraldo Mastering PostgreSQL: Table and Index Best Practices for Peak Performance As a database…
Mastering Ansible Batch Execution: The Key to Safer, More Efficient Deployments

2024年7月15日

Mastering Ansible Batch Execution: The Key to Safer, More Efficient Deployments

In the fast-paced world of DevOps and infrastructure management, controlled and efficient deployments are crucial…
Unlocking Linux Performance: A Hands-On Journey Through System Calls

2024年7月14日

Unlocking Linux Performance: A Hands-On Journey Through System Calls

By Jhon Jairo Murillo Giraldo Ever wondered how your applications whisper their deepest desires to the Linux kernel?…

1 条评论
Best Practices for Robust Data Pipeline Design

2024年7月14日

Best Practices for Robust Data Pipeline Design

By Jhon Jairo Murillo Giraldo In today's data-driven world, the integrity and efficiency of our data pipelines are…

See all articles

Scaling ML Dreams: A Journey Through Distributed MLOps

John Murillo-Giraldo

Senior Cloud DevOps | AWS, Azure, OpenShift & Kubernetes

The Awakening

The Tools of Our Trade

The Chronicles of Scaling

领英推荐

The Happy Ending?

John Murillo-Giraldo的更多文章

社区洞察

其他会员也浏览了

AWS Machine Learning Workflow

Issue #199 - THE ML ENGINEER ??

Building AI Pipelines with MLOps and SRE: A Practical Guide

Model Deployment Techniques for Machine Learning Models

LLMOps Workflow Orchestration

Issue #177 - THE ML ENGINEER ??

Issue #178 - THE ML ENGINEER ??

Issue #172 - THE ML ENGINEER ??

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

Part 4: Elevating My Standup Reports with AI (and a Touch of My Voice DNA)

The Awakening

The Tools of Our Trade

The Chronicles of Scaling

领英推荐

The Happy Ending?

John Murillo-Giraldo的更多文章

The Art of Monitoring: A DevOps and SRE Perspective

Mastering AWS: A Comprehensive Guide to Building High-Performance, Secure, and Cost-Effective API Architectures

Unlocking the Power of CloudFront in Your Private VPC: A Game-Changer for Secure Content Delivery

10 Steps to Launching Your Career in DevOps

From SDLC to NoOps: Navigating the Future of Software Development

Unleashing the Power of Mock Servers: Supercharge Your API Development in Postman

PostgreSQL Table and Index Mistakes: Insights and Best Practices

Mastering Ansible Batch Execution: The Key to Safer, More Efficient Deployments

Unlocking Linux Performance: A Hands-On Journey Through System Calls

Best Practices for Robust Data Pipeline Design

社区洞察

其他会员也浏览了

AWS Machine Learning Workflow

Issue #199 - THE ML ENGINEER ??

Building AI Pipelines with MLOps and SRE: A Practical Guide

Model Deployment Techniques for Machine Learning Models

LLMOps Workflow Orchestration

Issue #177 - THE ML ENGINEER ??

Issue #178 - THE ML ENGINEER ??

Issue #172 - THE ML ENGINEER ??

From Kubernetes to Generative AI: The Future of Work - Harnessing the Power of MongoDB Atlas

Part 4: Elevating My Standup Reports with AI (and a Touch of My Voice DNA)