Machine Learning Workflows in Production

Machine Learning Workflows in Production

To serve the needs of 50 million customers, Zalando uses a distributed organizational structure with ML expertise across 100+ product teams. These teams consist of software engineers and applied scientists who use a mix of third-party and internal tools. An additional central ML productivity team develops and supports the internal tools.

The machine learning project lifecycle consists of several phases. This article discusses in detail two of these phases: experimentation and production. Each one comes with specific requirements, and we will see how these requirements have influenced technology choices at Zalando.

The ML Journey

The ML lifecycle usually encompasses the conceptual phase, data discovery, experimentation, production, and operation. Moving from experimentation to production is one of the key steps. In practice, we frequently turn a Jupyter notebook into a machine-learning pipeline that needs to follow best engineering practices. Let’s look at this step more closely.

What is an ML Project Lifecycle?
ML Project Lifecycle

Experimentation Requirements

An experiment typically validates or falsifies the hypothesis, for example: that one algorithm performs better than the others. Experiments require running jobs, tracking parameters, metrics, metadata, etc. A useful experimentation platform must meet several requirements. Here are the most important ones:

  • Quick start. Creating an experiment should be fast and easy.
  • Iteration speed, e.g., rapid feature engineering, model fitting, and parameter tuning. The experimentation environment should come with “batteries included,” i.e., provide access to commonly used libraries and visualization tools.
  • Easy access to data needed by the experiments.
  • Access to high-performance computing (HPC). Working with big data and large models is not uncommon, so easy access to Spark and powerful GPUs is essential even in the experimental phase.

Production Requirements

Jupyter notebooks don’t meet the requirements for large-scale production deployment. Critical requirements for production include:

  • Code versioning and four-eyes principle. Any code deployed to production must be versioned in git, readable, follow best engineering practices, and be reviewed by two or more people.
  • Reproducibility. It should be possible to replicate past deployments.
  • Performance and scalability to ensure low response times and to automatically scale to traffic peaks, e.g., during events such as Cyber Week.
  • Privacy and compliance to maintain user trust and meet regulatory requirements.
  • Security and access control, e.g., ensuring that data is protected.
  • Observability. Ability to monitor model performance, be notified about problems, and debug them if they occur.

Technology Choices at Zalando

While assembling an end-to-end ML platform, Zalando has made several technology choices, opting to use available open source and commercial software – when available – and to build in-house tools when it provided significant speedup or there were no satisfying alternatives on the market.?

Here are the main components of Zalando's internal ML platform:

  • A hosted version of open-source JupyterHub called Datalab provides a notebook environment available via a web browser. The tool is actively used by over 500 people, mainly analysts and applied scientists. With Datalab, in 30 seconds, anyone can start a fully functional workspace with preinstalled libraries, access to common data sources, and an option to use SageMaker CPU or GPU instances.
  • High-performance computing (HPC) cluster. Zalando has a cluster of high-performance Nvidia GPUs for selected use cases.
  • Databricks for experimentation with notebooks and big data processing in Spark.
  • GitHub Enterprise for version control, collaboration, and code reviews.
  • Continuous Delivery Platform (CDP), an in-house CI/CD software.
  • Backstage is an open-source platform for building developer portals. At Zalando, we developed our plugins for CDP and ML platforms.
  • AWS is the main provider of cloud services. Some services we use for machine learning projects are Amazon SageMaker, AWS Lambda, S3, Step Functions, and CloudFormation.
  • zflow, an in-house library providing a domain-specific language (DSL) for ML pipelines. zflow pipelines are written as Python scripts that generate CloudFormation templates. The templates describe the infrastructure necessary to deploy and execute ML pipelines, most notably the Step Functions state machine for pipeline orchestration. Individual pipeline stages can be implemented by services such as Databricks, AWS Lambda, and Amazon SageMaker.

ML Pipeline Orchestration

Let's dive deeper into ML pipeline orchestration in production. If a hypothesis is validated in an experimentation environment (a notebook or a custom script running in the HPC cluster), the recommended way to prepare it for production is to implement it as a zflow pipeline. Pipelines must meet all previously mentioned requirements for production deployment. The entire process consists of several steps outlined below.

First, we define the pipeline using zflow DSL in a Python script. This definition will include stages such as dynamic configuration (usually done by a separate AWS Lambda function), data processing (with Databricks, Lambda, or SageMaker), model training (with SageMaker), and inference (SageMaker). zflow offers flow control so stages can be run in parallel and conditionally. Because zflow script is just regular Python code annotated with type hints, users can leverage their IDE to spot mistakes early.

Second, we execute the script. This will generate a CloudFormation (CF) template. We utilize CF to define infrastructure as code. A CF template describes resources such as AWS Lambda functions and AWS IAM security policies used by the Step Functions state machine.

Third, we commit the zflow script and generate a CF template to GitHub Enterprise.

Fourth, we let the Zalando Continuous Delivery Platform deploy it to AWS. As the template is deployed, CF creates all the resources listed. The pipeline will be represented as a Step Functions state machine and can be seen in the AWS console.

ML Pipeline visualized in the AWS Step Functions console.
ML Pipeline visualized in the AWS Step Functions console.

Fifth, the pipeline is executed either manually from the console or programmatically via a scheduler or API call.

Sixth, we store the database execution history, including detailed information about each stage, e.g., status messages, errors, and stack traces.

Seventh, our platform provides authors with near real-time pipeline monitoring and visualization in the Backstage developer portal. For example, you can see how metrics evolve across multiple runs of training pipelines and can view these changes on a graph. Any error messages are reported in the UI.

Going forward

To help product teams at Zalando move faster and use ML more effectively, a few central teams operate and maintain JupyterHub, the HPC cluster, zflow, and Backstage. ML consultants support best practice sharing with training and ad-hoc collaboration. Zalando ML Platform is constantly evolving, and some of our current areas of focus are:

  • Lead time reduction, e.g., by simplifying the transition from experimentation to production.
  • Cost reduction and optimization.
  • Improved observability, e.g., providing in-context information on ML models and pipelines.

Machine learning is an essential part of the Zalando technology landscape. By improving our ML and MLOps practices, we keep up with rapid developments in the field and use them to improve customer experience and the daily work of builders within the company.

About the author

Krzysztof (Chris) Szafranek is a Senior Software Engineer at Zalando ML Productivity team. He has been with Zalando for 5+ years and a software engineer for 17. He holds a master's degree in Computer Science.

AI Guild Announcements

Still waiting to be a member of the AI Guild?

Apply online at?https://www.theguild.ai/.

要查看或添加评论,请登录

AI Guild的更多文章

社区洞察

其他会员也浏览了