Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks
Introduction
Harnessing the power of Azure Databricks, this article sheds light on constructing an XGBoost multi-class classification model on a sample big dataset (100M+ rows) using PySpark. Alongside, we explore the unique use of kmeans() clustering to impute missing values and use CrossValidator for hyperparameter tuning.
1. XGBoost - A Brief Overview
XGBoost (eXtreme Gradient Boosting) offers an efficient and scalable implementation of gradient boosting. Apt for classification, regression, and ranking tasks, XGBoost’s prowess is its ability to perform parallel computations and address missing values seamlessly.
2. Kmeans Clustering for Imputation
kmeans clustering for imputation functions by clustering similar data points. With clusters at our disposal, any missing value in a cluster can be:
PySpark Implementation on Azure Databricks
Setting Up and Importing Libraries
Ensure the Databricks runtime is set up to support PySpark operations.
from pyspark.sql import SparkSession
from pyspark.ml.feature import Imputer
from pyspark.ml.clustering import KMeans
from pyspark.ml.classification import XGBoostEstimator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
spark = SparkSession.builder.appName('XGBoost with Databricks and PySpark').getOrCreate()
Loading the Dataset from Azure Databricks Delta Table
data = spark.read.format("delta").load("/mnt/delta/large_dataset_delta_table")
Kmeans for Imputation of Missing Values in Dataset
领英推荐
# K-means clustering
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(data)
clusters = model.transform(data)
# Group by cluster and compute mean of each group
cluster_means = clusters.groupBy("prediction").mean()
# Impute missing values by joining cluster means back to the original data
data_imputed = clusters.join(cluster_means, "prediction")
Correlation Matrix Between Response Variable and Input Variables
# Convert to feature vectors for correlation calculation
assembler = VectorAssembler(inputCols=data.columns, outputCol="features")
vector_data = assembler.transform(data_imputed)
# Compute correlation matrix
corr_matrix = Correlation.corr(vector_data, "features").head()
print(str(corr_matrix[0]))
Train-Test Split
(train_data, test_data) = data_imputed.randomSplit([0.8, 0.2], seed=42)
XGBoost Model Training with CrossValidator
xgb = XGBoostEstimator(
featuresCol="features",
labelCol="label",
predictionCol="prediction"
)
paramGrid = (ParamGridBuilder()
.addGrid(xgb.maxDepth, [5, 7, 10])
.addGrid(xgb.eta, [0.01, 0.1])
.addGrid(xgb.numRound, [10, 50])
.build())
crossval = CrossValidator(estimator=xgb,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=10)
cvModel = crossval.fit(train_data)
Model Evaluation
# Predict on test set
predictions = cvModel.transform(test_data)
# Classification report: compute accuracy and other metrics
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(predictions, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"})
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
print(f"Weighted Precision: {weightedPrecision}")
print(f"Weighted Recall: {weightedRecall}")
Conclusion
With this article, we illustrated the practical application of the XGBoost model using PySpark on the Azure Databricks platform. The combined power of Azure Databricks, PySpark, and XGBoost ensures a robust and scalable solution for large datasets. While the provided code offers a foundational framework, model validation and interpretation, supported by domain expertise, are critical for tangible outcomes.
Realtor Associate @ Next Trend Realty LLC | HAR REALTOR, IRS Tax Preparer
1 年Thanks for Sharing.