A Metaflow serverless Story

A Metaflow serverless Story

Joining a modern ML framework and a Serverless Engine

Overview

Metaflow is an open-source framework for building and deploying data science projects. Its mission is to make it easier for data scientists to build and deploy production-ready machine learning workflows, by providing a high-level abstraction framework for common data science tasks such as data preparation, model training, and deployment.

"Everything you need to develop data science and ML apps"

The goal of Metaflow is to help data scientists focus on the actual data science tasks, rather than getting bogged down in the details of infrastructure, data management and deployment. This is achieved by providing a set of tools and abstractions that simplify the development process and enable data scientists to create complex workflows with ease.

Running ML complex projects requires compute capabilities, which are often challenging to configure and maintain when there is the need to submit such workflows at production scale.

Metaflow has made it easy for data scientists and machine learning engineers to run ML workflows both locally and utilising cloud resources (along with rapidly shifting between the two!). However, cloud resources have their own quirks and subtleties that aren’t always ideal for data science and ML:

  • @batch: slow and reliant on the quirks of AWS. In some case if the number of parallel tasks exceed the available resources the system may slow-down or crash. It works, but breaks "the great developers experience" which is one of the Metaflow dogmas.
  • @kube: While it provides many benefits, there are some limitations to its use:Complexity: Running workflows on Kubernetes clusters requires a certain level of technical expertise and knowledge of Kubernetes.Scalability: While Kubernetes clusters are designed for scalability, running workflows may not always lead to linear performance improvements. Depending on the complexity of the workflow and the size of the cluster, it may not be possible to run tasks in parallel or at the desired scale.

Is there still space to further improve compute resources assignment when running large and complex data science projects, such as lowering the latency for executing tasks in the cloud?


Serverless is here…to help

Serverless platforms are, on paper, a perfect candidate to have the cake and eat it too: code execution is fast (vs @batch), and the underlying computation is fully managed and abstracted away (vs @kube).??

As the wise man said, “in theory there is no difference between theory and practice - but in practice, there is”: when using cloud vendor serverless solution, there are some tradeoffs that should be taken into account:

  • Cold start: this is the most common performance issue, and refers to the instantiation of the container which is going to execute the function code as well the initialization phase. This is typically happening when a function is called for the first time or having its configuration changed, when the function scales out or simply when the function has not been invoked for a while. The difference between "cold" and "warm" starts makes it difficult to consistently predict performance.
  • Tooling limitations, deployments/packaging tools: Deployments tools interact with the platform usually via an API and this makes it difficult to handle dependencies and deployment in a consistent way from one platform to another as these API are not standardised.
  • Tooling limitations, execution environments: Functions executes with limited CPU, memory, disk and I/O resources and unlike legacy server processes, they cannot run indefinitely. It is true that as the underlying hardware platform gets more powerful, these resource limits are more and more reduced.?
  • Vendor lock-in:? being heavily dependent on a specific cloud service provider makes difficult or costly to switch to another provider or bring the workload in-house and this creates various challenges and risks as:- Increased switch costs: due to re-architecting applications, migrating data and training staff on new tools and services- Limited flexibility: relying on specific cloud providers limits the possibility to customise service or infrastructure impacting the ability to innovate, optimise costs or adapt to changing business needs.- Dependency on proprietary services: cloud providers offer services that are tightly coupled with their platform, therefore it can be difficult to switch to other provider or replicate these services in-house- Price and contractual risks: being locked on a specific cloud provider exposes to the risk of pricing and/or contractual risks. For example cloud providers may increase prices, change SLA or terminate services without sufficient notice.

Overall, Metaflow is a powerful and flexible tool for managing the entire data science workflow, from data ingestion to model deployment. It has been designed with portability and flexibility in mind to minimise the above mentioned issues and risks, but it is true that on the computing side there is still the need to rely on some specific cloud providers services (AWS and/or Azure), and normally these are the components which lacks the most of flexibility and customization.

We saw a crazy opportunity: can open source serverless provide a new backbone for open source ML pipelines? With Nuvolaris, we believe it can.


Nuvolaris 101

Nuvolaris started with the idea of implementing a portable and open platform, based on the Apache Openwhisk serverless engine, simplifying the process of building and deploying cloud native applications.

Nuvolaris provides solutions and tools to solve some of the above issues:

  • Cold Start: Nuvolaris leverage OpenWhisk capabilities to pre-warm a given number of runtime instances with predefined resource allocation (currently limited to memory requirements). This will significantly speed-up the execution time of a serverless action as, at the moment a request to execute a function arrives, the execution environment is already up and waiting for request to be processed. This can be achieved without the need of using additional components to keep a function warmed (as a cron job for instance). Nuvolaris provides ad-hoc AI/ML enabled runtimes to extend OpenWhisk capabilities also to AI/ML scenarios.
  • Tooling limitations, deployments/packaging tools: Nuvolaris CLI gives the possibility to deploy a full serverless API by leveraging the OpenWhisk package capability which is essentially bundling together one or more serverless functions relying on a folder based structure. Dependencies can be added by including them in the action folder (sometimes via package manager as for instance in case of js action) and zipping everything together. For relatively simple cases there is no need at all to introduce CI/CD pipelines, as the deployment can be fully controlled via the nuv CLI tool. Still CI/CD pipelines are the recommended approach for complex scenarios.
  • Tooling limitations, execution environments: Nuvolaris gives the possibility to customise OpenWhisk configuration allowing the deployment of serverless functions with specific memory and timeout, setting values that can go beyond the typical limitations of other serverless platforms (as AWS Lambda for instance). CPU/GPU customization is yet to come, but it is already included in the Nuvolaris features roadmap.
  • Vendor lock-in: Nuvolaris is built on a large and solid suite of open source pieces of software starting from the most important component which is the serverless engine OpenWhisk. It applies the principle to write once-deploy everywhere. To a certain extent it can still be considered a sort of vendor lock-in, because applications need to be developed relying on the provided components, with the difference that the produced artefacts can be deployed on any supported Kubernetes runtime and moved from one cloud provider to another without any extra additional development effort, including on-premise data-centers.

While OpenWisk, at the core of Nuvolaris, is usually associated with microservices, there is no principled reason to stop there. And that’s why, after starting using Metaflow, we asked ourselves:

?"If serverless platform are capable of executing functions, what about executing a Metaflow @step as a function"?

It turns out we can, and we will explain how in the following section.


The @nuvolaris decorator

Our @nuvolaris prototype implements a step decorator that can be used by the Metaflow "scheduler" to deploy and execute a serverless function inside an OpenWhisk runtime, as an alternative to @batch and @kube for remote execution.

The user experience is as seamless as all Metaflow features are: it is sufficient to add the @nuvolaris decorator to a @step function in a flow! For example, the following small pipeline implements parallel executions of a step through nuvolaris:

from metaflow import FlowSpec, step, nuvolaris

class ForeachFlow(FlowSpec):
    @step
    def start(self):
        self.titles = ['Stranger Things',
                       'House of Cards',
                       'Narcos',
                       'Suburra',
                       'Star Trek',
                       'Mission Impossible',
                       'Mission Impossible 2',
                       'Mission Impossible 3',
                       'Rogue']
        self.next(self.a, foreach='titles')

    @nuvolaris(namespace="nuvolaris", action="each", memory=256, timeout=120000)
    @step
    def a(self):
        self.title = '%s processed' % self.input
        self.next(self.join)

    @step
    def join(self, inputs):
        self.results = [input.title for input in inputs]
        self.next(self.end)

    @step
    def end(self):
        print('\n'.join(self.results))

if __name__ == '__main__':
    ForeachFlow()

        

When the above code runs, Metaflow launches parallel execution of the @nuvolaris functions, deploying behind the scene an “action” with the specified parameters in OpenWhisk, and start polling for completion:

2023-03-21 19:40:01.386 [1679427598783956/start/1 (pid 579034)] Foreach yields 9 child steps.
...
2023-03-21 19:40:04.932 [1679427598783956/a/9 (pid 579231)] creating action each with memory=256 and timeout=120000
...
2023-03-21 19:40:05.448 [1679427598783956/a/9 (pid 579231)] checking completion of nuvolaris activation a731a8fd14924f95b1a8fd14927f9527
        

As a video is worth a thousand log traces, we captured the complete execution of the above example in a video !

While preliminary, this prototype already shows a fantastic experience provided by OpenWhisk: in particular, the engine keeps up accepting requests and executing them, even if there are not enough resources for parallel execution, without affecting the overall execution flow.

Summing up our experience, we believe that Nuvolaris is a good fit for Metaflow, offering out of the box desirable features which particularly fits ML scenarios:

  • Gives the possibility to assign remote computational resources effortlessly, in a similar fashion as the @batch and @kube existing plugins.?
  • Nuvolaris will maintain AI/ML specific runtimes:- It will provide a great development experience when experimenting with Metaflow, as execution time is not impacted by cold-start issues and it is achieved by leveraging OpenWhisk capabilities to pre-warm and configure such runtimes.- It will simplify the dependency management burden.
  • It is possible to customise timeout and memory settings bypassing the typical limit of other serverless platforms, such as AWS Lambda for instance.
  • OpenWhisk backpressure capabilities offer a reliable way to guarantee that long running and parallel ML tasks can be completed without affecting the overall flow execution.?


Where we are now and what’s next

Currently the Nuvolaris Metaflow plugin supports Openwhisk customization parameters for namespace, action Name, memory and timeout; it is implemented starting from a fork of Metaflow 2.7.14 and is executing the actions via a custom python Openwhisk runtime with the required demo dependencies.

Of course, this is not the last word on the serverless @step functions, but it’s a very good first step. Next steps we are thinking about already are things like the following:

  • Add support for additional OpenWhisk customization parameter, e.g number of CPU
  • Add support to execute Metaflow code on GPU enabled hardware (via Kubernetes plugins for instance)
  • Provide and sup ML/AI enabled OpenWhisk python runtimes.
  • Improves interaction with OpenWhisk API for launching and monitoring the execution of every specific function.
  • Integrate @nuvolaris as an official plugin Metaflow. Ok this is very ambitious, but as an open source company Nuvolaris fully embrace the open source initiative and contribution to other successful projects does not seem a bad idea at all.

If you want to try our prototype yourself, clone the repo , check the video and don’t be shy: get in touch with us for anything, including taking some of the above next steps together.

要查看或添加评论,请登录

社区洞察