MLflow Puts the “Science” in Data Science
The Scientific Method has been practiced for hundreds of years to advance the knowledge of humans and our understanding of the universe around us.? It is the process of objectively establishing facts through observation, experimentation, and analysis.
The process includes:
The phrase “Scientific Method” may conjure up the names of famous scientists like Issac Newton and Marie Curie, but today’s Data Scientists are following this same process in their everyday work when training machine learning (ML) models.? The Data Science Lifecycle closely resembles the Scientific Method in that it starts with an observation, or really an understanding of the business.? From that understanding comes a hypothesis about how the business can be optimized.? Data is then collected and prepared from one or more sources while it is explored and analyzed for quality and patterns.? Next the task of modeling begins whereby features are created and selected for importance.? Various parameters are experimented with while training and evaluating the model to see the impact on the target variable.
Machine Learning vs Software Development
Developing an ML model is different to software development.? Software is developed against a functional specification and can be tested to prove that the application meets the specification or not.? ML development however, is focused on optimizing metrics by applying a variety of algorithms to the problem, which is intertwined with the data.? A Data Scientist must run many different experiments with slightly different configurations and measure the impact on the results.? One configuration will be selected to become a “model”, and will be promoted to production to be inferred against new data as it arrives.? The performance of this model in production will degrade over time to the point where it will need to be retrained, and the cycle continues.
领英推荐
Introducing MLflow
The Data Science Life Cycle has been partially implemented by individual ML libraries, like scikit-learn and Tensorflow. ? However, there had not been a de facto project to unify all of these libraries across the entire lifecycle.? This was the motivation behind the creation of MLflow, which was released as open source by Databricks in 2018.? Its aim is to create an open framework to support the ML lifecycle for any library used in the training and deployment of ML models.??
One of the first areas of focus for #mlflow was to allow Data Scientists to easily track all of the different parameters used to train a model, and the resulting metrics from each run of an “experiment”.? This helps with comparing the different runs to find the best model fit.? MLflow tracks more than just the parameters used to train a model.? It also tracks the entire environment, including the libraries used to train a model and their exact versions.? When combined with Delta Lake’s time travel feature, it can track the exact version of the dataset used without needing to copy the data.? Just as researchers must independently reproduce experiment results, so too do Data Scientists need to independently reproduce the results of a model.? This is particularly true when the Data Scientist who trained a model moves onto another project, or another company.? A new Data Scientist is left to reproduce the model before retraining.? MLflow tracks the exact version of the environment and data so that a model can easily be reproduced.
Simplifying Discovery
Tracking all of the different permutations of configuration used to train a model can easily run into the hundreds, if not thousands.? Before MLflow, it was not uncommon for Data Scientists to track this info via spreadsheets.? MLflow allows the tracking of experiments to be centralized for an entire organization.? This not only helps the individual Data Scientist who trains a model, but the entire team who need to reproduce models and evaluate which permutations had already been attempted in the past.? This has been a big enabler for the biopharmaceutical company Amgen, which uses Databricks to support 240 different data science projects.? One such project is to decrease the time it takes to enroll in clinical trials.? Accelerating any part of the drug discovery process can result in savings of tens of millions of dollars.? They combine a variety of clinical data, including purchased data and real-world evidence, to improve the likelihood of success.? They are able to do this by streamlining cross-team collaboration and standardizing the full lifecycle from experimentation to production.
Looking Ahead
As MLflow passes its 4 year anniversary, it continues to lead the way in simplifying the Data Science Lifecycle.? In June, MLflow 2.0 was announced, including support for new MLflow Pipelines.? The early releases of MLflow were focused on the organization of projects, packaging of models, and tracking of the model training.? This latest release with MLflow Pipelines is focused on combining modular ML code with software engineering best practices to make model deployment fast and scalable.? This deserves a deeper dive in a future blog post.
Method to the Madness
The Scientific Method has been propelling the human race forward through its use of experimentation, observation, analysis, and reproducibility.? This is true whether discovering the laws of physics, radiation, a new drug, or even why customers are likely to churn.? Data Science has built upon this foundation to model itself upon the same principals, and MLflow has unified this flow to help take machine learning from prototype to production faster.? Whether you are attempting to understand the universe or the business, remember to heed Sir Issac Newton’s advice that “No great discovery was ever made without a bold guess.”
Sr. Specialist Solutions Architect at Databricks
1 年Great perspective how an ML experiment is after all just a kind of scientific experiment albeit a new one. Fond memories of past readings in the philosophy of science, Popper etc. ;)
Sr Partner Technical Manager at Databricks | 14k followers | All views are my own
2 年Great article Jason Pohl thanks for writing it