登录查看更多内容

Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)

Pradeep Reddy

Product Management - AI/ML & Data Platforms at Apple

发布日期: 2018年9月15日

In part 1 of this series we saw how we can train various models using 70% of the data available. In this part we will explore how we can compare the model performance on the remaining 30% of the data. The main goal of this part is to select a model that performs optimally which we will deploy as a web service for real-time scoring in the final part of the series.

Model Comparison & Selection

For binary classification model comparison and selection, below are a few metrics, graphs/curves that help visualize the model performance and select the winning model.

Accuracy: Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same

Precision and Recall: While recall expresses the ability to find all relevant instances in a dataset, precision expresses the proportion of the data points our model says was relevant were actually relevant. Precision is a good measure to consider, when the costs of False Positive is high Eg: Email Spam Detection. In contrast Recall is a good measure to consider when the cost of False negative is extremely high Eg: Cancer Detection. The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

F Measure : F Measure is the harmonic mean of precision and recall which takes both metrics into account

ROC (Receiving Operating Characteristics) : For binary classification models, a useful evaluation metric is the area under the ROC(Receiver Operating Characteristic) curve. An ROC curve is created by taking a binary classification predictor that uses a threshold value to assign labels given predicted continuous values. As you vary the threshold for a model you cover from the two extremes, when the true positive rate (TPR) and the false positive rate (FPR) are both 0 it implies that everything is labeled “not churned” and when both the TPR and FPR are 1 it implies that everything is labeled “churned”. A random predictor that labels a customer as churned half the time and not churned the other half would have a ROC that was a straight diagonal line. This line cuts the unit square into two equally-sized triangles, so the area under the curve is 0.5. An AUROC value of 0.5 would mean that your predictor was no better at discriminating between the two classes than random guessing. The closer the value is to 1.0, the better its predictions are. A value below 0.5 indicates that we could actually make our model produce better predictions by reversing the answer it gives us.

The below code block takes the models trained in part 1 of the series and exercises them on 30% unseen test dataset. To evaluate the model performance, for each model we will leverage the BinaryClassificationMetrics available in the spark package "org.apache.spark.mllib.evaluation"

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Dataset

def initClassificationMetrics(dataset: Dataset[_]) : BinaryClassificationMetrics = {
    val scoreAndLabels =
      dataset.select(col("probability"), col("label").cast(DoubleType)).rdd.map {
        case Row(prediction: org.apache.spark.ml.linalg.Vector, label: Double) => ( prediction(1), label)
        case Row(prediction: Double, label: Double) => (prediction, label)
      }
    val metrics = new BinaryClassificationMetrics(scoreAndLabels)
    metrics
 }

val randomForestPredictions= randomForestModel.transform(test)
val randomForestMetrics = initClassificationMetrics(randomForestPredictions)
val aurocRandomForest = randomForestMetrics.areaUnderROC
val auprcRandomForest = randomForestMetrics.areaUnderPR
val precisionRandomForest = randomForestMetrics.precisionByThreshold.toDF
                        .withColumn("model",lit("RandomForestClassifier"))
val recallRandomForest = randomForestMetrics.recallByThreshold.toDF
                        .withColumn("model",lit("RandomForestClassifier"))
val fMeasureRandomForest = randomForestMetrics.fMeasureByThreshold.toDF
                        .withColumn("model",lit("RandomForestClassifier"))
val rocRandomForest = randomForestMetrics.roc.toDF
                      .withColumn("model",lit("RandomForestClassifier"))
                      .withColumn("area",lit(aurocRandomForest))
val prRandomForest = randomForestMetrics.pr.toDF
                     .withColumn("model",lit("RandomForestClassifier"))
                     .withColumn("area",lit(auprcRandomForest))

val logisticPredictions = logisticRegressionModel.transform(test)
val logisticMetrics = initClassificationMetrics(logisticPredictions)
val aurocLogistic = logisticMetrics.areaUnderROC
val auprcLogistic = logisticMetrics.areaUnderPR
val precisionLogistic = logisticMetrics.precisionByThreshold.toDF
                  .withColumn("model",lit("LogisticRegressionClassifier"))
val recallLogistic = logisticMetrics.recallByThreshold.toDF
                  .withColumn("model",lit("LogisticRegressionClassifier"))
val fMeasureLogistic = logisticMetrics.fMeasureByThreshold.toDF
                  .withColumn("model",lit("LogisticRegressionClassifier"))
val rocLogistic = logisticMetrics.roc.toDF
                 .withColumn("model",lit("LogisticRegressionClassifier"))
                 .withColumn("area",lit(aurocLogistic))
val prLogistic = logisticMetrics.pr.toDF
                 .withColumn("model",lit("LogisticRegressionClassifier"))
                 .withColumn("area",lit(auprcLogistic))

val gbtPredictions = gbtModel.transform(test)
val gbtMetrics = initClassificationMetrics(gbtPredictions)
val aurocGBT = gbtMetrics.areaUnderROC
val auprcGBT =  gbtMetrics.areaUnderPR
val precisionGBT = gbtMetrics.precisionByThreshold.toDF
               .withColumn("model",lit("GradientBoostingTreesClassifier"))
val recallGBT = gbtMetrics.recallByThreshold.toDF
               .withColumn("model",lit("GradientBoostingTreesClassifier"))
val fMeasureGBT = gbtMetrics.fMeasureByThreshold.toDF
               .withColumn("model",lit("GradientBoostingTreesClassifier"))
val rocGBT = gbtMetrics.roc.toDF
             .withColumn("model",lit("GradientBoostingTreesClassifier"))
             .withColumn("area",lit(aurocGBT))
val prGBT = gbtMetrics.pr.toDF
             .withColumn("model",lit("GradientBoostingTreesClassifier"))
             .withColumn("area",lit(auprcGBT))

val xgBoostPredictions = xgBoostModel.transform(test)
val xgBoostMetrics = initClassificationMetrics(xgBoostPredictions)
val aurocXG = xgBoostMetrics.areaUnderROC
val auprcXG =  xgBoostMetrics.areaUnderPR
val precisionXG = xgBoostMetrics.precisionByThreshold.toDF
                     .withColumn("model",lit("XGBoostClassifier"))
val recallXG = xgBoostMetrics.recallByThreshold.toDF
                     .withColumn("model",lit("XGBoostClassifier"))
val fMeasureXG = xgBoostMetrics.fMeasureByThreshold.toDF
                     .withColumn("model",lit("XGBoostClassifier"))
val rocXG = xgBoostMetrics.roc.toDF
                     .withColumn("model",lit("XGBoostClassifier"))
                     .withColumn("area",lit(aurocXG))
val prXG = xgBoostMetrics.pr.toDF
                     .withColumn("model",lit("XGBoostClassifier"))
                     .withColumn("area",lit(auprcXG))

val precisionAll = precisionRandomForest.unionAll(precisionLogistic)
                      .unionAll(precisionGBT).unionAll(precisionXG)
val recallAll = recallRandomForest.unionAll(recallLogistic)
                      .unionAll(recallGBT).unionAll(recallXG)
val fMeasureAll = fMeasureRandomForest.unionAll(fMeasureLogistic)
                      .unionAll(fMeasureGBT).unionAll(fMeasureXG)
val rocAll=rocRandomForest.unionAll(rocLogistic)
                      .unionAll(rocGBT).unionAll(rocXG)
val prAll=prRandomForest.unionAll(prLogistic)
                      .unionAll(prGBT).unionAll(prXG)

precisionAll.createOrReplaceTempView("PRECISION_ALL")
recallAll.createOrReplaceTempView("RECALL_ALL")
fMeasureAll.createOrReplaceTempView("FMEASURE_ALL")
rocAll.createOrReplaceTempView("ROC_ALL")
prAll.createOrReplaceTempView("PR_ALL")

For visualizing model performance, we will switch from Scala to Python. The below block of code creates some reusable python functions to render various plotly charts/graphs. Please note that these routines are geared towards use in Qubole's notebook platform. These routines need slight modifications if you need them to work with Jupyter or other notebook platforms.

%pyspark
import plotly
import plotly.graph_objs as go

def plot(plot_dic, width="100%", height="100%", **kwargs):
    kwargs['output_type'] = 'div'
    plot_str = plotly.offline.plot(plot_dic, **kwargs)
    print('%%angular <div style="height: %s; width: %s"> %s </div>' % (height, width, plot_str))
    
def visualizeModels(modelDataFrame,plotType):
    data = []
    if plotType == 'ROC':
        area_text = 'AUROC'
        title = 'Area Under Receiving Operating Characteristic Curve (AUROC)'
        xtitle = 'False Positive Rate (FPR) or 1-Specificity'
        ytitle = 'True Positive Rate (TPR) or Sensitivity'
    if plotType == 'PR':
        area_text = 'AUPRC'
        title = 'Area Under Precision Recall Curve (AUPRC)'
        xtitle = 'Recall'
        ytitle = 'Precision'
    if plotType == 'PRECISION':
        title = 'Precision By Threshold Curve'
        xtitle = 'Threshold'
        ytitle = 'Precision'
    if plotType == 'RECALL':
        title = 'Recall by Threshold Curve'
        xtitle = 'Threshold'
        ytitle = 'Recall'
    if plotType == 'F-MEASURE':
        title = 'F-measure by Threshold Curve'
        xtitle = 'Threshold'
        ytitle = 'F-Measure'
        
    # transform dataframe into dic to pass data into plotly chart
    for i in range(len(modelDataFrame['model'].unique())):
        name = modelDataFrame['model'].unique()[i]
        name_text = name
        if 'area' in modelDataFrame:
            name_text = '{0} ({1} = {2:0.2f})'.format(name,area_text,modelDataFrame['area'].unique()[i])
        x_dim = modelDataFrame[ modelDataFrame['model'] == name ]['_1'].tolist()
        y_dim = modelDataFrame[ modelDataFrame['model'] == name ]['_2'].tolist()
        trace = dict(name = name_text, x = x_dim, y = y_dim, mode = 'lines', showlegend = True)
        data.append(trace)
    
    if plotType == 'ROC':
        trace1 = go.Scatter(x = [0,1],y =[0,1],line=dict(color='navy', dash='dash'),name = 'Pure Chance/Coin Toss', showlegend = True)
        data.append(trace1)
         
    layout = dict(title = title,xaxis = dict(title=xtitle),yaxis = dict(title=ytitle), 
                    legend=dict(x=0.6, y=1.119,font=dict(size=9)), showlegend = True, height = 600, width = 600)
    fig = dict( data         ?=data, layout=layout )
    plot(fig,  show_link=False)

The below code block leverages the above reusable routines to produce ROC, PRC and various other curves like Precision-Threshold, Recall-Threshold & F-Measure Threshold curves. For demonstration purposes here, we will look at the 2 important model evaluation curves. The model with highest area under the ROC and PRC is the best performing model for the given data set and predictive problem we set out to solve.

%pyspark
rocAll = sqlContext.table("ROC_ALL").toPandas()
visualizeModels(rocAll,'ROC')

prAll = sqlContext.table("PR_ALL").toPandas()
visualizeModels(prAll,'PR')

Conclusion

Even if the telco churn dataset contained 5 million observations, using the BinaryClassificationMetrics in combination with Plotly python open source library, its easy to evaluate, plot, visualize and compare model performance. In the final part of this series, we will explore how we can export the winning model and deploy it as a web service for real-time scoring. The technical content for this blog was curated using Qubole’s cloud-native big data platform. Qubole offers a choice of cloud, big data engines, tools and technologies to activate big data in the cloud. Sign up for a free Qubole account now to get started.

要查看或添加评论，请登录

Pradeep Reddy的更多文章

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

2025年2月1日

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…
Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

2025年1月8日

Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…
Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

2025年1月4日

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

Disclaimer: The views and opinions expressed in this article are solely mine. This article and any subsequent article…

1 条评论
Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

2020年3月22日

Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

Transcending time and space, Many countries across the world have long equated patriotism with armed forces. But none…

6 条评论
Visualize/Analyze progression of COVID-19 (Part 1 of 2)

2020年3月16日

Visualize/Analyze progression of COVID-19 (Part 1 of 2)

Similar to COVID-19 outbreak that started in China, Back in 1854 when London was emerging as the first modern city of…

4 条评论
Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )

2019年6月1日

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )

In the first part of this blog series, we were able to acquire the historical batting/bowling records of players…

2 条评论
Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

2019年5月29日

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

Cricket is a game I always loved, growing up as a kid in India. Cricket World Cup is an ODI tournament organized by…
Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

2019年5月17日

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

In the part 1 of this series, we looked at how we can summarize the raw data points that track various customer…

2 条评论
Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

2019年5月17日

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

Have you ever wondered how big corporations plan their budgets and determine spends on different marketing channels?…
Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

2018年9月15日

Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

Just to recap in part 1 of the series, we looked at how to train various models to solve the same classification…

See all articles

Machine Learning Model Comparison, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 2 of 3)

Pradeep Reddy

Product Management - AI/ML & Data Platforms at Apple

Model Comparison & Selection

Conclusion

Pradeep Reddy的更多文章

社区洞察

其他会员也浏览了

24 Ultimate Data Science (ML) projects to work on in 2022

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Applied Machine Learning Projects: Course Launch

How AI Helps Us Think, and ML Helps Us Improve

Data Science Roadmap 2025: Complete and Comprehensive Journey

Machine learning production systems

Handle Missing Data with Iterative Imputer by Scikit Learn

Learning Data Science, Winning Solutions in Hackathons, AR Models & Much More!!

Unleashing the Power of Knowledge Graphs in RAG Applications

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs Dataiku DSS. Part 3. Logistic Regression.

Model Comparison & Selection

Conclusion

Pradeep Reddy的更多文章

Personal Private Cloud with Palm Sized System On Chips (Part 3 - Realize Object Storage and build your own Website)

Personal Private Cloud with Palm Sized System On Chips (Part 2 - Implement Distributed File and Block Storage )

Personal Private Cloud with Palm Sized System On Chips (Part 1 - Build the Cloud)

Visualize/Analyze Progression of COVID-19 (Part 2 of 2)

Visualize/Analyze progression of COVID-19 (Part 1 of 2)

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 2 of 2 )

Data Wrangling and Predictive Analytics for ICC Cricket World Cup 2019 using Pyspark & Python ( Part 1 of 2 )

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 2 of 2

Multi-Channel Marketing Attribution with Apache Spark(R) and Markov Models in R - Part 1 of 2

Machine Learning Model Evaluation, Visualization & Deployment using Apache Spark, Plotly & Flask (Part 3 of 3)

社区洞察

其他会员也浏览了

24 Ultimate Data Science (ML) projects to work on in 2022

Your First Steps in Data Science: Top 10 Machine Learning Algorithms for Beginners

Applied Machine Learning Projects: Course Launch

How AI Helps Us Think, and ML Helps Us Improve

Data Science Roadmap 2025: Complete and Comprehensive Journey

Machine learning production systems

Handle Missing Data with Iterative Imputer by Scikit Learn

Learning Data Science, Winning Solutions in Hackathons, AR Models & Much More!!

Unleashing the Power of Knowledge Graphs in RAG Applications

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs Dataiku DSS. Part 3. Logistic Regression.