Machine Learning Workflows in Production
To serve the needs of 50 million customers, Zalando uses a distributed organizational structure with ML expertise across 100+ product teams. These teams consist of software engineers and applied scientists who use a mix of third-party and internal tools. An additional central ML productivity team develops and supports the internal tools.
The machine learning project lifecycle consists of several phases. This article discusses in detail two of these phases: experimentation and production. Each one comes with specific requirements, and we will see how these requirements have influenced technology choices at Zalando.
The ML Journey
The ML lifecycle usually encompasses the conceptual phase, data discovery, experimentation, production, and operation. Moving from experimentation to production is one of the key steps. In practice, we frequently turn a Jupyter notebook into a machine-learning pipeline that needs to follow best engineering practices. Let’s look at this step more closely.
Experimentation Requirements
An experiment typically validates or falsifies the hypothesis, for example: that one algorithm performs better than the others. Experiments require running jobs, tracking parameters, metrics, metadata, etc. A useful experimentation platform must meet several requirements. Here are the most important ones:
Production Requirements
Jupyter notebooks don’t meet the requirements for large-scale production deployment. Critical requirements for production include:
Technology Choices at Zalando
While assembling an end-to-end ML platform, Zalando has made several technology choices, opting to use available open source and commercial software – when available – and to build in-house tools when it provided significant speedup or there were no satisfying alternatives on the market.?
Here are the main components of Zalando's internal ML platform:
ML Pipeline Orchestration
Let's dive deeper into ML pipeline orchestration in production. If a hypothesis is validated in an experimentation environment (a notebook or a custom script running in the HPC cluster), the recommended way to prepare it for production is to implement it as a zflow pipeline. Pipelines must meet all previously mentioned requirements for production deployment. The entire process consists of several steps outlined below.
First, we define the pipeline using zflow DSL in a Python script. This definition will include stages such as dynamic configuration (usually done by a separate AWS Lambda function), data processing (with Databricks, Lambda, or SageMaker), model training (with SageMaker), and inference (SageMaker). zflow offers flow control so stages can be run in parallel and conditionally. Because zflow script is just regular Python code annotated with type hints, users can leverage their IDE to spot mistakes early.
领英推荐
Second, we execute the script. This will generate a CloudFormation (CF) template. We utilize CF to define infrastructure as code. A CF template describes resources such as AWS Lambda functions and AWS IAM security policies used by the Step Functions state machine.
Third, we commit the zflow script and generate a CF template to GitHub Enterprise.
Fourth, we let the Zalando Continuous Delivery Platform deploy it to AWS. As the template is deployed, CF creates all the resources listed. The pipeline will be represented as a Step Functions state machine and can be seen in the AWS console.
Fifth, the pipeline is executed either manually from the console or programmatically via a scheduler or API call.
Sixth, we store the database execution history, including detailed information about each stage, e.g., status messages, errors, and stack traces.
Seventh, our platform provides authors with near real-time pipeline monitoring and visualization in the Backstage developer portal. For example, you can see how metrics evolve across multiple runs of training pipelines and can view these changes on a graph. Any error messages are reported in the UI.
Going forward
To help product teams at Zalando move faster and use ML more effectively, a few central teams operate and maintain JupyterHub, the HPC cluster, zflow, and Backstage. ML consultants support best practice sharing with training and ad-hoc collaboration. Zalando ML Platform is constantly evolving, and some of our current areas of focus are:
Machine learning is an essential part of the Zalando technology landscape. By improving our ML and MLOps practices, we keep up with rapid developments in the field and use them to improve customer experience and the daily work of builders within the company.
About the author
Krzysztof (Chris) Szafranek is a Senior Software Engineer at Zalando ML Productivity team. He has been with Zalando for 5+ years and a software engineer for 17. He holds a master's degree in Computer Science.
AI Guild Announcements
Still waiting to be a member of the AI Guild?
Apply online at?https://www.theguild.ai/.