Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks

Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks

Introduction

Harnessing the power of Azure Databricks, this article sheds light on constructing an XGBoost multi-class classification model on a sample big dataset (100M+ rows) using PySpark. Alongside, we explore the unique use of kmeans() clustering to impute missing values and use CrossValidator for hyperparameter tuning.

1. XGBoost - A Brief Overview

XGBoost (eXtreme Gradient Boosting) offers an efficient and scalable implementation of gradient boosting. Apt for classification, regression, and ranking tasks, XGBoost’s prowess is its ability to perform parallel computations and address missing values seamlessly.

2. Kmeans Clustering for Imputation

kmeans clustering for imputation functions by clustering similar data points. With clusters at our disposal, any missing value in a cluster can be:

  1. Imputed with the centroid value of that cluster for the respective feature.
  2. Filled with a mean/median value calculated from non-missing values within that cluster.

PySpark Implementation on Azure Databricks

Setting Up and Importing Libraries

Ensure the Databricks runtime is set up to support PySpark operations.

from pyspark.sql import SparkSession
from pyspark.ml.feature import Imputer
from pyspark.ml.clustering import KMeans
from pyspark.ml.classification import XGBoostEstimator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

spark = SparkSession.builder.appName('XGBoost with Databricks and PySpark').getOrCreate()        

Loading the Dataset from Azure Databricks Delta Table

data = spark.read.format("delta").load("/mnt/delta/large_dataset_delta_table")        

Kmeans for Imputation of Missing Values in Dataset

# K-means clustering
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(data)
clusters = model.transform(data)

# Group by cluster and compute mean of each group
cluster_means = clusters.groupBy("prediction").mean()

# Impute missing values by joining cluster means back to the original data
data_imputed = clusters.join(cluster_means, "prediction")        

Correlation Matrix Between Response Variable and Input Variables

# Convert to feature vectors for correlation calculation
assembler = VectorAssembler(inputCols=data.columns, outputCol="features")
vector_data = assembler.transform(data_imputed)

# Compute correlation matrix
corr_matrix = Correlation.corr(vector_data, "features").head()
print(str(corr_matrix[0]))        

Train-Test Split

(train_data, test_data) = data_imputed.randomSplit([0.8, 0.2], seed=42)        

XGBoost Model Training with CrossValidator

xgb = XGBoostEstimator(
    featuresCol="features", 
    labelCol="label", 
    predictionCol="prediction"
)

paramGrid = (ParamGridBuilder()
             .addGrid(xgb.maxDepth, [5, 7, 10])
             .addGrid(xgb.eta, [0.01, 0.1])
             .addGrid(xgb.numRound, [10, 50])
             .build())

crossval = CrossValidator(estimator=xgb,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=10)

cvModel = crossval.fit(train_data)        

Model Evaluation

# Predict on test set
predictions = cvModel.transform(test_data)

# Classification report: compute accuracy and other metrics
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(predictions, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"})

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
print(f"Weighted Precision: {weightedPrecision}")
print(f"Weighted Recall: {weightedRecall}")        

Conclusion

With this article, we illustrated the practical application of the XGBoost model using PySpark on the Azure Databricks platform. The combined power of Azure Databricks, PySpark, and XGBoost ensures a robust and scalable solution for large datasets. While the provided code offers a foundational framework, model validation and interpretation, supported by domain expertise, are critical for tangible outcomes.





CHESTER SWANSON SR.

Realtor Associate @ Next Trend Realty LLC | HAR REALTOR, IRS Tax Preparer

1 年

Thanks for Sharing.

要查看或添加评论,请登录

Chirag S.的更多文章

社区洞察

其他会员也浏览了