Machine Learning - MLflow for managing the end-to-end machine learning lifecycle
Image By Author

Machine Learning - MLflow for managing the end-to-end machine learning lifecycle

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. The main advantages of including MLflow in your ML lifecycle is the transparency and standardisation it brings to the table when it comes to training, tuning and deploying ML models. It lets you train, reuse, and deploy models with any library and package them into reproducible steps that other data scientists can use as a “black box,” without even having to know which library you are using.

Photo By MLflow

MLflow Setup

In the first step, we will set up the central repository database where we will log all our tracking information and we will also create an artifacts folder to store our models and other relevant information about the models.

# Terminal Command
	
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./artifacts \
--host 0.0.0.0 \        

MLflow Tracking

MLflow Tracking is an API and UI for logging parameters, code versions, metrics and output files when running your machine learning code to later visualize them. With a few simple lines of code, you can track parameters, metrics, and artifacts:

Photo By MLflow

Set Tracking URI and Create Experiment

Then in our notebook we will set up our tracking server as you can see in the example we are using localhost but you direct this to any other server as well based on passing the host and the port in the set_tracking_uri call. Initially, you will have a default experiment created which you can get with get_experiment call or you can create your own as shown in the next step.

mlflow.set_tracking_uri("https://127.0.0.1:8080/")
experiment = mlflow.get_experiment('0')
print("Name: {}".format(experiment.name))
print("Artifact Location: {}".format(experiment.artifact_location))
print("Lifecycle_stage: {}".format(experiment.lifecycle_stage))
print("Experiment ID: {}".format(experiment.experiment_id))        

Import Packages

Lets import all the packages that we will be using in this example.

import os
import sys
import pandas as pd
import warnings as w
import numpy as np
import datetime
import time
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix, precision_score, matthews_corrcoef, recall_score, f1_score, plot_confusion_matrix, roc_auc_score, classification_report
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
w.filterwarnings('ignore',category=Warning)
init_notebook_mode(connected=True)
sns.set(style="whitegrid")
%matplotlib inline        

Create Dataset

In this example, we will create our own dataset through the sklearn.datasets package.

Package: sklearn.datasets.make_classification

from sklearn.datasets import make_classification
X, y = make_classification(
	   n_classes=2, class_sep=0.5, weights=[0.6, 0.4],
	   n_informative=3, n_redundant=1, flip_y=0.3,
	   n_features=20, n_clusters_per_class=3,
	   n_samples=80000, random_state=11
	)
model_dataset = pd.DataFrame(X)
model_dataset['Class'] = y
model_dataset = sklearn.utils.shuffle(model_dataset)        

Split Dataset

After shuffling the dataset in the previous step we will now split the dataset into a train, test and validation set.

n = len(model_dataset)
train_df = model_dataset[0:int(n*0.8)]
val_df = model_dataset[int(n*0.8):int(n*0.9)]
test_df = model_dataset[int(n*0.9):]
X_train, y_train = train_df.iloc[:,:-1], train_df.iloc[:,-1]
X_val, y_val = val_df.iloc[:,:-1], val_df.iloc[:,-1]
X_test, y_test = test_df.iloc[:,:-1], test_df.iloc[:,-1]        

MLflow Auto-Logging On?

We will then turn on the autolog functionality in MLflow to record all the relevant information about the model run automatically. However, In final training of the model we will also look at how to log information and artifacts manually.

# enable autologgin
mlflow.sklearn.autolog(log_models=True)
mlflow.xgboost.autolog(log_models=True)g        

Hyperparameter Tuning

Next, we can define the search procedure with all the elements mentioned below and run the RandomizedSearchCV after starting the mlflow instance through the mlflow.start_run method. We pass two arguments in mlflow.start_run, experiment_id and run_name, which will be saved in the mlflow front end i.e. UI.

  • Define Search Space: Dictionary with parameters names (str) as keys and distributions or lists of parameters to try. Distributions must provide a rvs method for sampling
  • Define Search: “estimator”xgb_reg is an object of that type is instantiated for each grid point. “n_iter” argument sets the number of iterations or samples to draw from the search space. In this case, we will set it to 500. “n_jobs” -1 means using all processors. “cv” determines the cross-validation splitting strategy. “scoring” determines the strategy to evaluate the performance of the cross-validated model on the test set.
  • Execute Search: Finally, fit with all the set parameters. In this example we will also use an evaluation set and set the early_stopping_rounds to 10 to avoid overfitting.

Finally, we can perform the optimization and report the results.

xgb_reg = xgb.XGBClassifier()
params = {
	        'num_boost_round': [5, 10, 15, 25],
	        'eta': [0.05, 0.001, 0.1, 0.3],
	        'max_depth': [3, 6, 5, 8],
	        'subsample': [0.9, 1, 0.8],
	        'colsample_bytree': [0.9, 1, 0.8],
	        'alpha': [0.1, 0.3, 0]
	   }
	

with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='debt_probability_model') as run:
	   random_search = RandomizedSearchCV(xgb_reg, params, cv=5, n_iter=100, verbose=1)
	    start = time.time()
	   random_search.fit(X_train,
	                     y_train,
	                     eval_set=[(X_train, y_train), (X_val, y_val)],
	                     early_stopping_rounds=10,
	                     verbose=True)
	   best_parameters = random_search.best_params_
	   print('RandomizedSearchCV Results: ')
	   print(random_search.best_score_)
	   print('Best Parameters: ')
	   for param_name in sorted(best_parameters.keys()):
	       print("%s: %r" % (param_name, best_parameters[param_name]))
	   end = time.time()
	   print('time elapsed: ' + str(end-start))
	   print(' ')
	   print('Best Estimator: ')
	   print(random_search.best_estimator_)
	   y_pred = random_search.predict(X_test)        

Logging Information Manually

You can also log information manually in mlflow as shown below:

mlflow.log_param() logs a single key-value param in the currently active run. The key and value are both strings. Use mlflow.log_params() to log multiple params at once.

mlflow.log_metric() logs a single key-value metric. The value must always be a number. MLflow remembers the history of values for each metric. Use mlflow.log_metrics() to log multiple metrics at once.

mlflow.log_artifact() logs a local file or directory as an artifact, optionally taking an artifact_path to place it within the run’s artifact URI. Run artifacts can be organised into directories, so you can place the artifact in a directory this way.

mlflow.log_artifacts() logs all the files in a given directory as artifacts, again taking an optional artifact_path.

Photo By MLflow

Final Data Split

Next, we will create our final train and test set to create the final model.

n = len(model_dataset)
train_df = model_dataset[0:int(n*0.8)]
test_df = model_dataset[int(n*0.8):]
X_train, y_train = train_df.iloc[:,:-1], train_df.iloc[:,-1]
X_test, y_test = test_df.iloc[:,:-1], test_df.iloc[:,-1]        

Final Model

Lets train our final model after starting the mlflow instance through the mlflow.start_run method. We pass two arguments in mlflow.start_run, experiment_id and run_name, which will be saved in the mlflow front end i.e. UI. We will train our model based on loading the best parameters from our RandomizedSearchCV method from previous step:

?xgb.XGBClassifier(**random_search.best_params_)?

In this example, we will log the parameters manually in MLflow as shown below:

  • mlflow.log_param(‘subsample’, xgb_dict[‘subsample’])
  • mlflow.log_param(‘num_boost_round’, xgb_dict[‘num_boost_round’])
  • mlflow.log_param(‘max_depth’, xgb_dict[‘max_depth’])
  • mlflow.log_param(‘eta’, xgb_dict[‘eta’])
  • mlflow.log_param(‘colsample_bytree’, xgb_dict[‘colsample_bytree’])
  • mlflow.log_param(‘alpha’, xgb_dict[‘alpha’])

We will also save the model manually with the help of log_model method in MLflow. All the files will be saved in the ./artifacts folder. Finally, I am calculating some metrics on the test set (i.e. Accuracy, F1, MCC) and also creating some plots to visualise the results of my model (i.e. Feature Importance, ROC, Confusion Matrix), which is then saved in the folder in my directory (i.e. www/xgb_results). You can also pass the directory into the log_artifacts method which will log all the visualisations with the model run and will be visible in the front end of the MLflow UI.?

At last I have also logged my feature names with the model run to make sure when we use this model in the future, I do have the name of all the features that were used in the training. This is a useful technique in case you are using one-hot encoding on your dataset and your features are dynamic depending on the values in the dataset.?

# Final Model
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='roi_xgb') as run:
	   xgb_reg_main = xgb.XGBClassifier(**random_search.best_params_)
	   xgb_reg_main.fit(X_train, y_train)
	   xgb_dict = random_search.best_params_
	   mlflow.set_tag('model_name', 'roi_xgb')

	   # Log Parameters
	   mlflow.log_param('subsample', xgb_dict['subsample'])
	   mlflow.log_param('num_boost_round', xgb_dict['num_boost_round'])
	   mlflow.log_param('max_depth', xgb_dict['max_depth'])
	   mlflow.log_param('eta', xgb_dict['eta'])
	   mlflow.log_param('colsample_bytree', xgb_dict['colsample_bytree'])
	   mlflow.log_param('alpha', xgb_dict['alpha'])

	   # Save Model
	   signature = infer_signature(X_train, xgb_reg_main.predict(X_train))
	   mlflow.xgboost.log_model(xgb_reg_main, "xgb_roi", signature=signature)
	   y_preds = xgb_reg_main.predict(X_test)
	   y_preds_proba = xgb_reg_main.predict_proba(X_test)

	   # Calculating Metrics
	   acc_xgb_main = (y_preds == y_test).sum().astype(float) / len(y_preds)*100
	   f1_xgb_main = f1_score(y_test, y_preds, average='micro')
	   mcc_xgb_main = matthews_corrcoef(y_test, y_preds)
	   features = X_train.columns

	   # Feature Importance
	   xgb_importances_main = pd.DataFrame({'Feature': features, 'Importance': xgb_reg_main.feature_importances_})
	   xgb_importances_main = xgb_importances_main.sort_values(by='Importance', ascending=False)
	   xgb_importances_main = xgb_importances_main.set_index('Feature')
	   imp_xgb_main = xgb_importances_main[:25].plot.bar(figsize=(15,8))
	   fig = imp_xgb_main.get_figure()
	   fig.savefig('www/xgb_results/xgb_main_imp.png',dpi=100, bbox_inches = 'tight')

	   # Test ROC
	   roc_xgb_main = metric_graph(y_test, y_preds_proba[:,1], metric='roc', figsize=(15, 8),filename='www/xgb_results/xgb_main_roc.png')

	   # Test Confusion Matrix
	   class_names = np.unique(y_test)
	   disp_xgb = plot_confusion_matrix(xgb_reg_main, X_test, y_test,
	                                    display_labels=class_names,
	                                    cmap=plt.cm.Blues)
	   disp_xgb.ax_.set_title("Model: XGBoost")
	   plt.savefig('www/xgb_results/xgb_main_cm.png',dpi=100, bbox_inches = 'tight')

	   # Test Log Metrics
	   mlflow.log_metric('test_accuracy', acc_xgb_main)
	   mlflow.log_metric('test_f1_score', f1_xgb_main)
	   mlflow.log_metric('test_mcc_score', mcc_xgb_main)
	   mlflow.log_artifacts('www/xgb_results')

	   # Log Features
	   pd.DataFrame(columns = X_train.columns).to_csv('roi_features.csv', index=False)
	   mlflow.log_artifact('roi_features.csv', artifact_path='features')        

MLflow does not support XGBoost model through scikit-learn API, which is why we log all the parameters and metrics manually. Let’s also look at an example for native implementation of XGBoost.

No alt text provided for this image
params = random_search.best_params_
params['eval_metric'] = 'mae'
del params['num_boost_round']
num_boost_round = 200
	

dtrain = xgb.DMatrix(X_train_undersample, label=y_train_undersample)
dtest = xgb.DMatrix(X_test_undersample, label=y_test_undersample)
	

# Final Model
with mlflow.start_run(experiment_id=experiment.experiment_id, run_name='roi_xgb') as run:
	   xgb_reg_main = xgb.train(params, dtrain, num_boost_round=num_boost_round, evals=[(dtest, "Test")], early_stopping_rounds=20)
	   mlflow.set_tag('model_name', 'roi_xgb')
	   # Log Parameters
	   mlflow.log_param('subsample', params['subsample'])
	   mlflow.log_param('max_depth', params['max_depth'])
	   mlflow.log_param('eta', params['eta'])
	   mlflow.log_param('colsample_bytree', params['colsample_bytree'])
	   mlflow.log_param('alpha', params['alpha'])
	   y_xgb_preds_main = xgb_reg_main.predict(dtest)
	   y_xgb_preds_main = [1 if n >= 0.5 else 0 for n in y_xgb_preds_main]
	   y_xgb_preds_main_proba = xgb_reg_main.predict(dtest)
	   # Calculating Metrics
	   acc_xgb_main = (y_xgb_preds_main == y_test_undersample).sum().astype(float) / len(y_xgb_preds_main)*100
	   f1_xgb_main = f1_score(y_test_undersample, y_xgb_preds_main, average='micro')
	   mcc_xgb_main = matthews_corrcoef(y_test_undersample, y_xgb_preds_main)
	   features = X_train_undersample.columns
	   # Feature Importance
	   ax = xgb.plot_importance(xgb_reg_main, max_num_features=25, height=0.5, importance_type='weight')
	   fig = ax.figure
	   fig.set_size_inches(15, 8)
	   fig.savefig('www/xgb_results/xgb_main_imp.png',dpi=100, bbox_inches = 'tight')
	   # Test ROC
	   roc_xgb_main = metric_graph(y_test_undersample, y_xgb_preds_main_proba, metric='roc', figsize=(15, 8),
	                            filename='www/xgb_results/xgb_main_roc.png')
	   # Test Confusion Matrix
	   import matplotlib.pyplot as plt
	   class_names = np.unique(y_test_undersample)
	   matrix = confusion_matrix(y_test_undersample, y_xgb_preds_main)
	   plt.clf()
	   # place labels at the top
	   plt.gca().xaxis.tick_top()
	   plt.gca().xaxis.set_label_position('top')
	   # plot the matrix per se
	   plt.imshow(matrix, interpolation='nearest', cmap=plt.cm.Blues)
	   # plot colorbar to the right
	   plt.colorbar()
	   fmt = 'd'
	   # write the number of predictions in each bucket
	   thresh = matrix.max() / 2.
	   for i, j in itertools.product(range(matrix.shape[0]), range(matrix.shape[1])):
	       # if background is dark, use a white number, and vice-versa
	       plt.text(j, i, format(matrix[i, j], fmt),
	            horizontalalignment="center",
	            color="white" if matrix[i, j] > thresh else "black")
	   tick_marks = np.arange(len(class_names))
	   plt.xticks(tick_marks, class_names, rotation=45)
	   plt.yticks(tick_marks, class_names)
	   plt.tight_layout()
	   plt.ylabel('True label',size=14)
	   plt.xlabel('Predicted label',size=14)
	   plt.title('Technique: Undersample | Model: XGBoost')
	   plt.savefig('www/xgb_results/xgb_main_cm.png',dpi=100, bbox_inches = 'tight')
	   # Test Log Metrics
	   mlflow.log_metric('test_accuracy', acc_xgb_main)
	   mlflow.log_metric('test_f1_score', f1_xgb_main)
	   mlflow.log_metric('test_mcc_score', mcc_xgb_main)
	   mlflow.log_artifacts('www/xgb_results')
	   # Log Features
	   pd.DataFrame(columns = X_train_undersample.columns).to_csv('roi_features.csv', index=False)
	   mlflow.log_artifact('roi_features.csv', artifact_path='features')        

MLflow Models

MLflow Models is a convention for packaging machine learning models in multiple formats called “flavors”. MLflow offers a variety of tools to help you deploy different flavors of models. Each MLflow Model is saved as a directory containing arbitrary files and an MLmodel descriptor file that lists the flavors it can be used in.

artifact_path: xgb_roi
flavors:
	 python_function:
	   data: model.xgb
	   env: conda.yaml
	   loader_module: mlflow.xgboost
	   python_version: 3.6.10
	 xgboost:
	   data: model.xgb
	   xgb_version: 1.3.3
run_id: 1c39ab98054340a5b14eebf975ae52b0        

MLflow Model Flavours

In this example, the model can be used with tools that support either the sklearn or python_function model flavors.

Photo By MLflow

Diverse Platform

MLflow provides tools to deploy many common model types to diverse platforms.

Photo By MLflow

Compare Model Runs

You can compare and visualise your model runs to see which version of the model is performing better.

Image By Author

Visualise Model Plots

You can visualise the plots we created in the final model run in the MLflow UI.

No alt text provided for this image

Register Model and Deploy in Production

You can register your models in the MLflow UI and deploy them in production which can then be loaded into another python code by specifying the stage and name of your model, which we will demonstrate in the next step.

Image By MLflow
Image By MLflow

Load Feature Names and Model from Production

Once our model is in production stage, we will then open another python notebook which has our prediction dataset. We will look for the model version that is in the production based on the search_model_versions method. We do this to create our model path to load the features that we saved with our model in the artifacts folder to make sure the feature names are the same in our prediction set.

from mlflow.tracking import MlflowClient
client = MlflowClient()
	

for mv in client.search_model_versions("name='roi_xgboost'"):
	   model_versions = dict(mv)
	   if model_versions['current_stage'] == 'Production':
	       xgb_dict = dict(mv)
	   else:
	       xgb_dict = '0'
	

if xgb_dict == '0':
	   print('No model version is deployed to production stage..')
else:
	   features_filepath = os.path.join('./artifacts/0/',xgb_dict['run_id'],'artifacts/features/roi_features.csv')
	

model_features = pd.read_csv(features_filepath)
	

# Get missing columns in the training set
missing_cols = set(model_features.columns) - set(X_pred.columns)
# Add a missing column in prediction set with default value equal to 0
for i in missing_cols:
	   X_pred[i] = 0
	   X_pred[i] = X_pred[i].astype('uint8')
# Ensure the order of column in the prediction set is in the same order than in training set
X_pred = X_pred[model_features.columns]        

Let’s start loading our model so we can do some predictions, first we will set the tracking uri and get the experiment.

mlflow.set_tracking_uri("https://127.0.0.1:8080/")
experiment = mlflow.get_experiment('0')
print("Name: {}".format(experiment.name))
print("Artifact Location: {}".format(experiment.artifact_location))
print("Lifecycle_stage: {}".format(experiment.lifecycle_stage))
print("Experiment ID: {}".format(experiment.experiment_id))        

We can load the model based on the stage that model is in, i.e. none, stage, production or archived. In our scenario we have deployed our model into production based on the MLflow UI in the previous steps. So, we will load the model from the production stage by load_model method. load_model method takes one argument which is model_uri.

xgb_model_name = "roi_xgboost"
stage = 'Production'


xgb_reg_main = mlflow.xgboost.load_model(
	   model_uri=f"models:/{xgb_model_name}/{stage}")        

Predictions on New?Dataset

Finally, we can make predictions on our prediction set.

XGBoost = xgb_reg_main.predict(xgb.DMatrix(X_pred))
debt_prob = pd.DataFrame(XGBoost, columns=['XGB_DEBT_PROBABILITY'])        

Automate the Life Cycle

There are now 2 ways I recommend you can automate your model lifecycle based on using open source methods such as airflow or cron jobs. You can automate your training script to run weekly/monthly and you can automate your predictions to happen daily.

Airflow:

Cron Jobs:

Summary

In this article, we covered the basics of MLflow and how to use MLflow for managing the end-to-end machine learning lifecycle.

MLflow provides a powerful way to simplify deployment of machine learning models within the organisation by tracking, managing and deploying models. Further, MLflow facilitates reproducibility, meaning that the same training or production machine learning code is designed to execute with the same results regardless of environments, whether in the cloud, on a local machine, or in a notebook.

Framework: Jupyter Notebook, Language: Python, Libraries: os, sys, datetime, time, sklearn, pandas, numpy, xgboost, matplotlib, seaborn and mlflow.

Follow me on Medium — TechFitLab

No alt text provided for this image


要查看或添加评论,请登录

社区洞察

其他会员也浏览了