Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)
In part 1 of this series we saw how we can train various models using 70% of the data available. In this part we will explore how we can compare the model performance on the remaining 30% of the data. The main goal of this part is to select a model that performs optimally which we will deploy as a web service for real-time scoring in the final part of the series.
Model Comparison & Selection
For binary classification model comparison and selection, below are a few metrics, graphs/curves that help visualize the model performance and select the winning model.
Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same
Precision and Recall: While recall expresses the ability to find all relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant were actually relevant. Precision is a good measure to consider, when the costs of False Positive is high Eg: Email Spam Detection. In contrast Recall is a good measure to consider when the cost of False negative is extremely high Eg: Cancer Detection. The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).
F Measure : F Measure is the harmonic mean of precision and recall which takes both metrics into account
ROC (Receiving Operating Characteristics) : For binary classification models, a useful evaluation metric is the area under the ROC(Receiver Operating Characteristic) curve. An ROC curve is created by taking a binary classification predictor that uses a threshold value to assign labels given predicted continuous values. As you vary the threshold for a model you cover from the two extremes, when the true positive rate (TPR) and the false positive rate (FPR) are both 0 it implies that everything is labeled “not churned” and when both the TPR and FPR are 1 it implies that everything is labeled “churned”. A random predictor that labels a customer as churned half the time and not churned the other half would have a ROC that was a straight diagonal line. This line cuts the unit square into two equally-sized triangles, so the area under the curve is 0.5. An AUROC value of 0.5 would mean that your predictor was no better at discriminating between the two classes than random guessing. The closer the value is to 1.0, the better its predictions are. A value below 0.5 indicates that we could actually make our model produce better predictions by reversing the answer it gives us.
The below code block takes the models trained in part 1 of the series and exercises them on 30% unseen test dataset. To evaluate the model performance, for each model we will leverage the BinaryClassificationMetrics available in the spark package "org.apache.spark.mllib.evaluation"
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Dataset
def initClassificationMetrics(dataset: Dataset[_]) : BinaryClassificationMetrics = {
val scoreAndLabels =
dataset.select(col("probability"), col("label").cast(DoubleType)).rdd.map {
case Row(prediction: org.apache.spark.ml.linalg.Vector, label: Double) => ( prediction(1), label)
case Row(prediction: Double, label: Double) => (prediction, label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
metrics
}
val randomForestPredictions= randomForestModel.transform(test)
val randomForestMetrics = initClassificationMetrics(randomForestPredictions)
val aurocRandomForest = randomForestMetrics.areaUnderROC
val auprcRandomForest = randomForestMetrics.areaUnderPR
val precisionRandomForest = randomForestMetrics.precisionByThreshold.toDF
.withColumn("model",lit("RandomForestClassifier"))
val recallRandomForest = randomForestMetrics.recallByThreshold.toDF
.withColumn("model",lit("RandomForestClassifier"))
val fMeasureRandomForest = randomForestMetrics.fMeasureByThreshold.toDF
.withColumn("model",lit("RandomForestClassifier"))
val rocRandomForest = randomForestMetrics.roc.toDF
.withColumn("model",lit("RandomForestClassifier"))
.withColumn("area",lit(aurocRandomForest))
val prRandomForest = randomForestMetrics.pr.toDF
.withColumn("model",lit("RandomForestClassifier"))
.withColumn("area",lit(auprcRandomForest))
val logisticPredictions = logisticRegressionModel.transform(test)
val logisticMetrics = initClassificationMetrics(logisticPredictions)
val aurocLogistic = logisticMetrics.areaUnderROC
val auprcLogistic = logisticMetrics.areaUnderPR
val precisionLogistic = logisticMetrics.precisionByThreshold.toDF
.withColumn("model",lit("LogisticRegressionClassifier"))
val recallLogistic = logisticMetrics.recallByThreshold.toDF
.withColumn("model",lit("LogisticRegressionClassifier"))
val fMeasureLogistic = logisticMetrics.fMeasureByThreshold.toDF
.withColumn("model",lit("LogisticRegressionClassifier"))
val rocLogistic = logisticMetrics.roc.toDF
.withColumn("model",lit("LogisticRegressionClassifier"))
.withColumn("area",lit(aurocLogistic))
val prLogistic = logisticMetrics.pr.toDF
.withColumn("model",lit("LogisticRegressionClassifier"))
.withColumn("area",lit(auprcLogistic))
val gbtPredictions = gbtModel.transform(test)
val gbtMetrics = initClassificationMetrics(gbtPredictions)
val aurocGBT = gbtMetrics.areaUnderROC
val auprcGBT = gbtMetrics.areaUnderPR
val precisionGBT = gbtMetrics.precisionByThreshold.toDF
.withColumn("model",lit("GradientBoostingTreesClassifier"))
val recallGBT = gbtMetrics.recallByThreshold.toDF
.withColumn("model",lit("GradientBoostingTreesClassifier"))
val fMeasureGBT = gbtMetrics.fMeasureByThreshold.toDF
.withColumn("model",lit("GradientBoostingTreesClassifier"))
val rocGBT = gbtMetrics.roc.toDF
.withColumn("model",lit("GradientBoostingTreesClassifier"))
.withColumn("area",lit(aurocGBT))
val prGBT = gbtMetrics.pr.toDF
.withColumn("model",lit("GradientBoostingTreesClassifier"))
.withColumn("area",lit(auprcGBT))
val xgBoostPredictions = xgBoostModel.transform(test)
val xgBoostMetrics = initClassificationMetrics(xgBoostPredictions)
val aurocXG = xgBoostMetrics.areaUnderROC
val auprcXG = xgBoostMetrics.areaUnderPR
val precisionXG = xgBoostMetrics.precisionByThreshold.toDF
.withColumn("model",lit("XGBoostClassifier"))
val recallXG = xgBoostMetrics.recallByThreshold.toDF
.withColumn("model",lit("XGBoostClassifier"))
val fMeasureXG = xgBoostMetrics.fMeasureByThreshold.toDF
.withColumn("model",lit("XGBoostClassifier"))
val rocXG = xgBoostMetrics.roc.toDF
.withColumn("model",lit("XGBoostClassifier"))
.withColumn("area",lit(aurocXG))
val prXG = xgBoostMetrics.pr.toDF
.withColumn("model",lit("XGBoostClassifier"))
.withColumn("area",lit(auprcXG))
val precisionAll = precisionRandomForest.unionAll(precisionLogistic)
.unionAll(precisionGBT).unionAll(precisionXG)
val recallAll = recallRandomForest.unionAll(recallLogistic)
.unionAll(recallGBT).unionAll(recallXG)
val fMeasureAll = fMeasureRandomForest.unionAll(fMeasureLogistic)
.unionAll(fMeasureGBT).unionAll(fMeasureXG)
val rocAll=rocRandomForest.unionAll(rocLogistic)
.unionAll(rocGBT).unionAll(rocXG)
val prAll=prRandomForest.unionAll(prLogistic)
.unionAll(prGBT).unionAll(prXG)
precisionAll.createOrReplaceTempView("PRECISION_ALL")
recallAll.createOrReplaceTempView("RECALL_ALL")
fMeasureAll.createOrReplaceTempView("FMEASURE_ALL")
rocAll.createOrReplaceTempView("ROC_ALL")
prAll.createOrReplaceTempView("PR_ALL")
For visualizing model performance, we will switch from Scala to Python. The below block of code creates some reusable python functions to render various plotly charts/graphs. Please note that these routines are geared towards use in Qubole's notebook platform. These routines need slight modifications if you need them to work with Jupyter or other notebook platforms.
%pyspark
import plotly
import plotly.graph_objs as go
def plot(plot_dic, width="100%", height="100%", **kwargs):
kwargs['output_type'] = 'div'
plot_str = plotly.offline.plot(plot_dic, **kwargs)
print('%%angular <div style="height: %s; width: %s"> %s </div>' % (height, width, plot_str))
def visualizeModels(modelDataFrame,plotType):
data = []
if plotType == 'ROC':
area_text = 'AUROC'
title = 'Area Under Receiving Operating Characteristic Curve (AUROC)'
xtitle = 'False Positive Rate (FPR) or 1-Specificity'
ytitle = 'True Positive Rate (TPR) or Sensitivity'
if plotType == 'PR':
area_text = 'AUPRC'
title = 'Area Under Precision Recall Curve (AUPRC)'
xtitle = 'Recall'
ytitle = 'Precision'
if plotType == 'PRECISION':
title = 'Precision By Threshold Curve'
xtitle = 'Threshold'
ytitle = 'Precision'
if plotType == 'RECALL':
title = 'Recall by Threshold Curve'
xtitle = 'Threshold'
ytitle = 'Recall'
if plotType == 'F-MEASURE':
title = 'F-measure by Threshold Curve'
xtitle = 'Threshold'
ytitle = 'F-Measure'
# transform dataframe into dic to pass data into plotly chart
for i in range(len(modelDataFrame['model'].unique())):
name = modelDataFrame['model'].unique()[i]
name_text = name
if 'area' in modelDataFrame:
name_text = '{0} ({1} = {2:0.2f})'.format(name,area_text,modelDataFrame['area'].unique()[i])
x_dim = modelDataFrame[ modelDataFrame['model'] == name ]['_1'].tolist()
y_dim = modelDataFrame[ modelDataFrame['model'] == name ]['_2'].tolist()
trace = dict(name = name_text, x = x_dim, y = y_dim, mode = 'lines', showlegend = True)
data.append(trace)
if plotType == 'ROC':
trace1 = go.Scatter(x = [0,1],y =[0,1],line=dict(color='navy', dash='dash'),name = 'Pure Chance/Coin Toss', showlegend = True)
data.append(trace1)
layout = dict(title = title,xaxis = dict(title=xtitle),yaxis = dict(title=ytitle),
legend=dict(x=0.6, y=1.119,font=dict(size=9)), showlegend = True, height = 600, width = 600)
fig = dict( data ?=data, layout=layout )
plot(fig, show_link=False)
The below code block leverages the above reusable routines to produce ROC, PRC and various other curves like Precision-Threshold, Recall-Threshold & F-Measure Threshold curves. For demonstration purposes here, we will look at the 2 important model evaluation curves. The model with highest area under the ROC and PRC is the best performing model for the given data set and predictive problem we set out to solve.
%pyspark
rocAll = sqlContext.table("ROC_ALL").toPandas()
visualizeModels(rocAll,'ROC')
prAll = sqlContext.table("PR_ALL").toPandas()
visualizeModels(prAll,'PR')
Conclusion
Even if the telco churn dataset contained 5 million observations, using the BinaryClassificationMetrics in combination with Plotly python open source library, its easy to evaluate, plot, visualize and compare model performance. In the final part of this series, we will explore how we can export the winning model and deploy it as a web service for real-time scoring. The technical content for this blog was curated using Qubole’s cloud-native big data platform. Qubole offers a choice of cloud, big data engines, tools and technologies to activate big data in the cloud. Sign up for a free Qubole account now to get started.