登录查看更多内容

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Gaurav Pahuja

Senior Data Scientist | DatSci 2019 Finalist | Python/Plotly-Dash | R/R-Shiny | Oracle SQL/BI | SQL | Machine Learning | Deep Learning | Techfitlab

发布日期: 2021年8月22日

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. The main advantages of including MLflow in your ML lifecycle is the transparency and standardisation it brings to the table when it comes to training, tuning and deploying ML models. It lets you train, reuse, and deploy models with any library and package them into reproducible steps that other data scientists can use as a “black box,” without even having to know which library you are using.

MLflow Setup

In the first step, we will set up the central repository database where we will log all our tracking information and we will also create an artifacts folder to store our models and other relevant information about the models.

# Terminal Command
	
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./artifacts \
--host 0.0.0.0 \

MLflow Tracking

MLflow Tracking is an API and UI for logging parameters, code versions, metrics and output files when running your machine learning code to later visualize them. With a few simple lines of code, you can track parameters, metrics, and artifacts:

Set Tracking URI and Create Experiment

Then in our notebook we will set up our tracking server as you can see in the example we are using localhost but you direct this to any other server as well based on passing the host and the port in the set_tracking_uri call. Initially, you will have a default experiment created which you can get with get_experiment call or you can create your own as shown in the next step.

mlflow.set_tracking_uri("https://127.0.0.1:8080/")
experiment = mlflow.get_experiment('0')
print("Name: {}".format(experiment.name))
print("Artifact Location: {}".format(experiment.artifact_location))
print("Lifecycle_stage: {}".format(experiment.lifecycle_stage))
print("Experiment ID: {}".format(experiment.experiment_id))

Import Packages

Lets import all the packages that we will be using in this example.

import os
import sys
import pandas as pd
import warnings as w
import numpy as np
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, precision_score, matthews_corrcoef, recall_score, f1_score, plot_confusion_matrix, roc_auc_score, classification_report
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
w.filterwarnings('ignore',category=Warning)
init_notebook_mode(connected=True)
sns.set(style="whitegrid")
%matplotlib inline

Create Dataset

In this example, we will create our own dataset through the sklearn.datasets package.

Package: sklearn.datasets.make_classification

from sklearn.datasets import make_classification
X, y = make_classification(
	   n_classes=2, class_sep=0.5, weights=[0.6, 0.4],
	   n_informative=3, n_redundant=1, flip_y=0.3,
	   n_features=20, n_clusters_per_class=3,
	   n_samples=80000, random_state=11
	)
model_dataset = pd.DataFrame(X)
model_dataset['Class'] = y
model_dataset = sklearn.utils.shuffle(model_dataset)

Split Dataset

After shuffling the dataset in the previous step we will now split the dataset into a train, test and validation set.

n = len(model_dataset)
train_df = model_dataset[0:int(n*0.8)]
val_df = model_dataset[int(n*0.8):int(n*0.9)]
test_df = model_dataset[int(n*0.9):]
X_train, y_train = train_df.iloc[:,:-1], train_df.iloc[:,-1]
X_val, y_val = val_df.iloc[:,:-1], val_df.iloc[:,-1]
X_test, y_test = test_df.iloc[:,:-1], test_df.iloc[:,-1]

MLflow Auto-Logging On?

We will then turn on the autolog functionality in MLflow to record all the relevant information about the model run automatically. However, In final training of the model we will also look at how to log information and artifacts manually.

# enable autologgin
mlflow.sklearn.autolog(log_models=True)
mlflow.xgboost.autolog(log_models=True)g

Hyperparameter Tuning

Next, we can define the search procedure with all the elements mentioned below and run the RandomizedSearchCV after starting the mlflow instance through the mlflow.start_run method. We pass two arguments in mlflow.start_run, experiment_id and run_name, which will be saved in the mlflow front end i.e. UI.

Define Search Space: Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling
Define Search: “estimator” — xgb_reg is an object of that type is instantiated for each grid point. “n_iter” argument sets the number of iterations or samples to draw from the search space. In this case, we will set it to 500. “n_jobs” -1 means using all processors. “cv” determines the cross-validation splitting strategy. “scoring” determines the strategy to evaluate the performance of the cross-validated model on the test set.
Execute Search: Finally, fit with all the set parameters. In this example we will also use an evaluation set and set the early_stopping_rounds to 10 to avoid overfitting.

Finally, we can perform the optimization and report the results.

xgb_reg = xgb.XGBClassifier()
params = {
	        'num_boost_round': [5, 10, 15, 25],
	        'eta': [0.05, 0.001, 0.1, 0.3],
	        'max_depth': [3, 6, 5, 8],
	        'subsample': [0.9, 1, 0.8],
	        'colsample_bytree': [0.9, 1, 0.8],
	        'alpha': [0.1, 0.3, 0]
	   }
	

with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='debt_probability_model') as run:
	   random_search = RandomizedSearchCV(xgb_reg, params, cv=5, n_iter=100, verbose=1)
	    start = time.time()
	   random_search.fit(X_train,
	                     y_train,
	                     eval_set=[(X_train, y_train), (X_val, y_val)],
	                     early_stopping_rounds=10,
	                     verbose=True)
	   best_parameters = random_search.best_params_
	   print('RandomizedSearchCV Results: ')
	   print(random_search.best_score_)
	   print('Best Parameters: ')
	   for param_name in sorted(best_parameters.keys()):
	       print("%s: %r" % (param_name, best_parameters[param_name]))
	   end = time.time()
	   print('time elapsed: ' + str(end-start))
	   print(' ')
	   print('Best Estimator: ')
	   print(random_search.best_estimator_)
	   y_pred = random_search.predict(X_test)

Logging Information Manually

You can also log information manually in mlflow as shown below:

mlflow.log_param() logs a single key-value param in the currently active run. The key and value are both strings. Use mlflow.log_params() to log multiple params at once.

mlflow.log_metric() logs a single key-value metric. The value must always be a number. MLflow remembers the history of values for each metric. Use mlflow.log_metrics() to log multiple metrics at once.

mlflow.log_artifact() logs a local file or directory as an artifact, optionally taking an artifact_path to place it within the run’s artifact URI. Run artifacts can be organised into directories, so you can place the artifact in a directory this way.

mlflow.log_artifacts() logs all the files in a given directory as artifacts, again taking an optional artifact_path.

Final Data Split

Next, we will create our final train and test set to create the final model.

n = len(model_dataset)
train_df = model_dataset[0:int(n*0.8)]
test_df = model_dataset[int(n*0.8):]
X_train, y_train = train_df.iloc[:,:-1], train_df.iloc[:,-1]
X_test, y_test = test_df.iloc[:,:-1], test_df.iloc[:,-1]

Final Model

Lets train our final model after starting the mlflow instance through the mlflow.start_run method. We pass two arguments in mlflow.start_run, experiment_id and run_name, which will be saved in the mlflow front end i.e. UI. We will train our model based on loading the best parameters from our RandomizedSearchCV method from previous step:

?xgb.XGBClassifier(**random_search.best_params_)?

In this example, we will log the parameters manually in MLflow as shown below:

mlflow.log_param(‘subsample’, xgb_dict[‘subsample’])
mlflow.log_param(‘num_boost_round’, xgb_dict[‘num_boost_round’])
mlflow.log_param(‘max_depth’, xgb_dict[‘max_depth’])
mlflow.log_param(‘eta’, xgb_dict[‘eta’])
mlflow.log_param(‘colsample_bytree’, xgb_dict[‘colsample_bytree’])
mlflow.log_param(‘alpha’, xgb_dict[‘alpha’])

We will also save the model manually with the help of log_model method in MLflow. All the files will be saved in the ./artifacts folder. Finally, I am calculating some metrics on the test set (i.e. Accuracy, F1, MCC) and also creating some plots to visualise the results of my model (i.e. Feature Importance, ROC, Confusion Matrix), which is then saved in the folder in my directory (i.e. www/xgb_results). You can also pass the directory into the log_artifacts method which will log all the visualisations with the model run and will be visible in the front end of the MLflow UI.?

At last I have also logged my feature names with the model run to make sure when we use this model in the future, I do have the name of all the features that were used in the training. This is a useful technique in case you are using one-hot encoding on your dataset and your features are dynamic depending on the values in the dataset.?

Machine Learning 2 年前

# Final Model
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='roi_xgb') as run:
	   xgb_reg_main = xgb.XGBClassifier(**random_search.best_params_)
	   xgb_reg_main.fit(X_train, y_train)
	   xgb_dict = random_search.best_params_
	   mlflow.set_tag('model_name', 'roi_xgb')

	   # Log Parameters
	   mlflow.log_param('subsample', xgb_dict['subsample'])
	   mlflow.log_param('num_boost_round', xgb_dict['num_boost_round'])
	   mlflow.log_param('max_depth', xgb_dict['max_depth'])
	   mlflow.log_param('eta', xgb_dict['eta'])
	   mlflow.log_param('colsample_bytree', xgb_dict['colsample_bytree'])
	   mlflow.log_param('alpha', xgb_dict['alpha'])

	   # Save Model
	   signature = infer_signature(X_train, xgb_reg_main.predict(X_train))
	   mlflow.xgboost.log_model(xgb_reg_main, "xgb_roi", signature=signature)
	   y_preds = xgb_reg_main.predict(X_test)
	   y_preds_proba = xgb_reg_main.predict_proba(X_test)

	   # Calculating Metrics
	   acc_xgb_main = (y_preds == y_test).sum().astype(float) / len(y_preds)*100
	   f1_xgb_main = f1_score(y_test, y_preds, average='micro')
	   mcc_xgb_main = matthews_corrcoef(y_test, y_preds)
	   features = X_train.columns

	   # Feature Importance
	   xgb_importances_main = pd.DataFrame({'Feature': features, 'Importance': xgb_reg_main.feature_importances_})
	   xgb_importances_main = xgb_importances_main.sort_values(by='Importance', ascending=False)
	   xgb_importances_main = xgb_importances_main.set_index('Feature')
	   imp_xgb_main = xgb_importances_main[:25].plot.bar(figsize=(15,8))
	   fig = imp_xgb_main.get_figure()
	   fig.savefig('www/xgb_results/xgb_main_imp.png',dpi=100, bbox_inches = 'tight')

	   # Test ROC
	   roc_xgb_main = metric_graph(y_test, y_preds_proba[:,1], metric='roc', figsize=(15, 8),filename='www/xgb_results/xgb_main_roc.png')

	   # Test Confusion Matrix
	   class_names = np.unique(y_test)
	   disp_xgb = plot_confusion_matrix(xgb_reg_main, X_test, y_test,
	                                    display_labels=class_names,
	                                    cmap=plt.cm.Blues)
	   disp_xgb.ax_.set_title("Model: XGBoost")
	   plt.savefig('www/xgb_results/xgb_main_cm.png',dpi=100, bbox_inches = 'tight')

	   # Test Log Metrics
	   mlflow.log_metric('test_accuracy', acc_xgb_main)
	   mlflow.log_metric('test_f1_score', f1_xgb_main)
	   mlflow.log_metric('test_mcc_score', mcc_xgb_main)
	   mlflow.log_artifacts('www/xgb_results')

	   # Log Features
	   pd.DataFrame(columns = X_train.columns).to_csv('roi_features.csv', index=False)
	   mlflow.log_artifact('roi_features.csv', artifact_path='features')

MLflow does not support XGBoost model through scikit-learn API, which is why we log all the parameters and metrics manually. Let’s also look at an example for native implementation of XGBoost.

params = random_search.best_params_
params['eval_metric'] = 'mae'
del params['num_boost_round']
num_boost_round = 200
	

dtrain = xgb.DMatrix(X_train_undersample, label=y_train_undersample)
dtest = xgb.DMatrix(X_test_undersample, label=y_test_undersample)
	

# Final Model
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='roi_xgb') as run:
	   xgb_reg_main = xgb.train(params, dtrain, num_boost_round=num_boost_round, evals=[(dtest, "Test")], early_stopping_rounds=20)
	   mlflow.set_tag('model_name', 'roi_xgb')
	   # Log Parameters
	   mlflow.log_param('subsample', params['subsample'])
	   mlflow.log_param('max_depth', params['max_depth'])
	   mlflow.log_param('eta', params['eta'])
	   mlflow.log_param('colsample_bytree', params['colsample_bytree'])
	   mlflow.log_param('alpha', params['alpha'])
	   y_xgb_preds_main = xgb_reg_main.predict(dtest)
	   y_xgb_preds_main = [1 if n >= 0.5 else 0 for n in y_xgb_preds_main]
	   y_xgb_preds_main_proba = xgb_reg_main.predict(dtest)
	   # Calculating Metrics
	   acc_xgb_main = (y_xgb_preds_main == y_test_undersample).sum().astype(float) / len(y_xgb_preds_main)*100
	   f1_xgb_main = f1_score(y_test_undersample, y_xgb_preds_main, average='micro')
	   mcc_xgb_main = matthews_corrcoef(y_test_undersample, y_xgb_preds_main)
	   features = X_train_undersample.columns
	   # Feature Importance
	   ax = xgb.plot_importance(xgb_reg_main, max_num_features=25, height=0.5, importance_type='weight')
	   fig = ax.figure
	   fig.set_size_inches(15, 8)
	   fig.savefig('www/xgb_results/xgb_main_imp.png',dpi=100, bbox_inches = 'tight')
	   # Test ROC
	   roc_xgb_main = metric_graph(y_test_undersample, y_xgb_preds_main_proba, metric='roc', figsize=(15, 8),
	                            filename='www/xgb_results/xgb_main_roc.png')
	   # Test Confusion Matrix
	   import matplotlib.pyplot as plt
	   class_names = np.unique(y_test_undersample)
	   matrix = confusion_matrix(y_test_undersample, y_xgb_preds_main)
	   plt.clf()
	   # place labels at the top
	   plt.gca().xaxis.tick_top()
	   plt.gca().xaxis.set_label_position('top')
	   # plot the matrix per se
	   plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.Blues)
	   # plot colorbar to the right
	   plt.colorbar()
	   fmt = 'd'
	   # write the number of predictions in each bucket
	   thresh = matrix.max() / 2.
	   for i, j in itertools.product(range(matrix.shape[0]), range(matrix.shape[1])):
	       # if background is dark, use a white number, and vice-versa
	       plt.text(j, i, format(matrix[i, j], fmt),
	            horizontalalignment="center",
	            color="white" if matrix[i, j] > thresh else "black")
	   tick_marks = np.arange(len(class_names))
	   plt.xticks(tick_marks, class_names, rotation=45)
	   plt.yticks(tick_marks, class_names)
	   plt.tight_layout()
	   plt.ylabel('True label',size=14)
	   plt.xlabel('Predicted label',size=14)
	   plt.title('Technique: Undersample | Model: XGBoost')
	   plt.savefig('www/xgb_results/xgb_main_cm.png',dpi=100, bbox_inches = 'tight')
	   # Test Log Metrics
	   mlflow.log_metric('test_accuracy', acc_xgb_main)
	   mlflow.log_metric('test_f1_score', f1_xgb_main)
	   mlflow.log_metric('test_mcc_score', mcc_xgb_main)
	   mlflow.log_artifacts('www/xgb_results')
	   # Log Features
	   pd.DataFrame(columns = X_train_undersample.columns).to_csv('roi_features.csv', index=False)
	   mlflow.log_artifact('roi_features.csv', artifact_path='features')

MLflow Models

MLflow Models is a convention for packaging machine learning models in multiple formats called “flavors”. MLflow offers a variety of tools to help you deploy different flavors of models. Each MLflow Model is saved as a directory containing arbitrary files and an MLmodel descriptor file that lists the flavors it can be used in.

artifact_path: xgb_roi
flavors:
	 python_function:
	   data: model.xgb
	   env: conda.yaml
	   loader_module: mlflow.xgboost
	   python_version: 3.6.10
	 xgboost:
	   data: model.xgb
	   xgb_version: 1.3.3
run_id: 1c39ab98054340a5b14eebf975ae52b0

MLflow Model Flavours

In this example, the model can be used with tools that support either the sklearn or python_function model flavors.

Diverse Platform

MLflow provides tools to deploy many common model types to diverse platforms.

Compare Model Runs

You can compare and visualise your model runs to see which version of the model is performing better.

Visualise Model Plots

You can visualise the plots we created in the final model run in the MLflow UI.

Register Model and Deploy in Production

You can register your models in the MLflow UI and deploy them in production which can then be loaded into another python code by specifying the stage and name of your model, which we will demonstrate in the next step.

Load Feature Names and Model from Production

Once our model is in production stage, we will then open another python notebook which has our prediction dataset. We will look for the model version that is in the production based on the search_model_versions method. We do this to create our model path to load the features that we saved with our model in the artifacts folder to make sure the feature names are the same in our prediction set.

from mlflow.tracking import MlflowClient
client = MlflowClient()
	

for mv in client.search_model_versions("name='roi_xgboost'"):
	   model_versions = dict(mv)
	   if model_versions['current_stage'] == 'Production':
	       xgb_dict = dict(mv)
	   else:
	       xgb_dict = '0'
	

if xgb_dict == '0':
	   print('No model version is deployed to production stage..')
else:
	   features_filepath = os.path.join('./artifacts/0/',xgb_dict['run_id'],'artifacts/features/roi_features.csv')
	

model_features = pd.read_csv(features_filepath)
	

# Get missing columns in the training set
missing_cols = set(model_features.columns) - set(X_pred.columns)
# Add a missing column in prediction set with default value equal to 0
for i in missing_cols:
	   X_pred[i] = 0
	   X_pred[i] = X_pred[i].astype('uint8')
# Ensure the order of column in the prediction set is in the same order than in training set
X_pred = X_pred[model_features.columns]

Let’s start loading our model so we can do some predictions, first we will set the tracking uri and get the experiment.

mlflow.set_tracking_uri("https://127.0.0.1:8080/")
experiment = mlflow.get_experiment('0')
print("Name: {}".format(experiment.name))
print("Artifact Location: {}".format(experiment.artifact_location))
print("Lifecycle_stage: {}".format(experiment.lifecycle_stage))
print("Experiment ID: {}".format(experiment.experiment_id))

We can load the model based on the stage that model is in, i.e. none, stage, production or archived. In our scenario we have deployed our model into production based on the MLflow UI in the previous steps. So, we will load the model from the production stage by load_model method. load_model method takes one argument which is model_uri.

xgb_model_name = "roi_xgboost"
stage = 'Production'


xgb_reg_main = mlflow.xgboost.load_model(
	   model_uri=f"models:/{xgb_model_name}/{stage}")

Predictions on New?Dataset

Finally, we can make predictions on our prediction set.

XGBoost = xgb_reg_main.predict(xgb.DMatrix(X_pred))
debt_prob = pd.DataFrame(XGBoost, columns=['XGB_DEBT_PROBABILITY'])

Automate the Life Cycle

There are now 2 ways I recommend you can automate your model lifecycle based on using open source methods such as airflow or cron jobs. You can automate your training script to run weekly/monthly and you can automate your predictions to happen daily.

Airflow:

Cron Jobs:

Summary

In this article, we covered the basics of MLflow and how to use MLflow for managing the end-to-end machine learning lifecycle.

MLflow provides a powerful way to simplify deployment of machine learning models within the organisation by tracking, managing and deploying models. Further, MLflow facilitates reproducibility, meaning that the same training or production machine learning code is designed to execute with the same results regardless of environments, whether in the cloud, on a local machine, or in a notebook.

Framework: Jupyter Notebook, Language: Python, Libraries: os, sys, datetime, time, sklearn, pandas, numpy, xgboost, matplotlib, seaborn and mlflow.

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

Gaurav Pahuja

Senior Data Scientist | DatSci 2019 Finalist | Python/Plotly-Dash | R/R-Shiny | Oracle SQL/BI | SQL | Machine Learning | Deep Learning | Techfitlab

MLflow Setup

MLflow Tracking

Set Tracking URI and Create Experiment

Import Packages

Create Dataset

Split Dataset

MLflow Auto-Logging On?

Hyperparameter Tuning

Logging Information Manually

Final Data Split

Final Model

领英推荐

MLflow Models

MLflow Model Flavours

Diverse Platform

Compare Model Runs

Visualise Model Plots

Register Model and Deploy in Production

Load Feature Names and Model from Production

Predictions on New?Dataset

Automate the Life Cycle

Summary

Follow me on Medium — TechFitLab

更多精彩文章

社区洞察

其他会员也浏览了

Demystifying Machine Learning Challenges – Imbalanced Data

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

Three key fails in Machine Learning

How Can You Build High-Performing Machine Learning Models with XGBoost?

Role of Feature Engineering in Machine Learning

A Walkthrough of a Machine Learning Task with Scikit-Learn

Null Imputation Bias and Fairness for Production ML Solutions

7 Stages of Machine Learning — A Framework

Bagging , Random Forest and Adaboost

MLflow Setup

MLflow Tracking

Set Tracking URI and Create Experiment

Import Packages

Create Dataset

Split Dataset

MLflow Auto-Logging On?

Hyperparameter Tuning

Logging Information Manually

Final Data Split

Final Model

领英推荐

MLflow Models

MLflow Model Flavours

Diverse Platform

Compare Model Runs

Visualise Model Plots

Register Model and Deploy in Production

Load Feature Names and Model from Production

Predictions on New?Dataset

Automate the Life Cycle

Summary

Follow me on Medium — TechFitLab

Machine Learning - Feature Scaling Techniques

2021年8月5日

Machine Learning - All you need to know about Outliers

2021年8月5日

Machine Learning - Hyperparameter Tuning

2021年8月4日

FIR Filter Design and Implementation using FPGAs

2020年10月6日

社区洞察

其他会员也浏览了

Demystifying Machine Learning Challenges – Imbalanced Data

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

Three key fails in Machine Learning

How Can You Build High-Performing Machine Learning Models with XGBoost?

Role of Feature Engineering in Machine Learning

A Walkthrough of a Machine Learning Task with Scikit-Learn

Null Imputation Bias and Fairness for Production ML Solutions

7 Stages of Machine Learning — A Framework

Bagging , Random Forest and Adaboost