登录查看更多内容

Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks

Chirag S.

Staff Data Scientist with 8+ years of work experience in data science & analytics | Graduate Student in Computational Data Analytics at Georgia Tech | M.S. Operations Research, Northeastern University

发布日期: 2023年9月27日

Introduction

Harnessing the power of Azure Databricks, this article sheds light on constructing an XGBoost multi-class classification model on a sample big dataset (100M+ rows) using PySpark. Alongside, we explore the unique use of kmeans() clustering to impute missing values and use CrossValidator for hyperparameter tuning.

1. XGBoost - A Brief Overview

XGBoost (eXtreme Gradient Boosting) offers an efficient and scalable implementation of gradient boosting. Apt for classification, regression, and ranking tasks, XGBoost’s prowess is its ability to perform parallel computations and address missing values seamlessly.

2. Kmeans Clustering for Imputation

kmeans clustering for imputation functions by clustering similar data points. With clusters at our disposal, any missing value in a cluster can be:

Imputed with the centroid value of that cluster for the respective feature.
Filled with a mean/median value calculated from non-missing values within that cluster.

PySpark Implementation on Azure Databricks

Setting Up and Importing Libraries

Ensure the Databricks runtime is set up to support PySpark operations.

from pyspark.sql import SparkSession
from pyspark.ml.feature import Imputer
from pyspark.ml.clustering import KMeans
from pyspark.ml.classification import XGBoostEstimator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

spark = SparkSession.builder.appName('XGBoost with Databricks and PySpark').getOrCreate()

Loading the Dataset from Azure Databricks Delta Table

data = spark.read.format("delta").load("/mnt/delta/large_dataset_delta_table")

Kmeans for Imputation of Missing Values in Dataset

领英推荐

The March 2024 MinIO Newsletter

MinIO 1 年前

Issue #162 - THE ML ENGINEER ??

Alejandro Saucedo 3 年前

Subject: ?? DATA Pill #098 - Deploy LLM in your…

Adam Kawa 11 个月前

# K-means clustering
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(data)
clusters = model.transform(data)

# Group by cluster and compute mean of each group
cluster_means = clusters.groupBy("prediction").mean()

# Impute missing values by joining cluster means back to the original data
data_imputed = clusters.join(cluster_means, "prediction")

Correlation Matrix Between Response Variable and Input Variables

# Convert to feature vectors for correlation calculation
assembler = VectorAssembler(inputCols=data.columns, outputCol="features")
vector_data = assembler.transform(data_imputed)

# Compute correlation matrix
corr_matrix = Correlation.corr(vector_data, "features").head()
print(str(corr_matrix[0]))

Train-Test Split

(train_data, test_data) = data_imputed.randomSplit([0.8, 0.2], seed=42)

XGBoost Model Training with CrossValidator

xgb = XGBoostEstimator(
    featuresCol="features", 
    labelCol="label", 
    predictionCol="prediction"
)

paramGrid = (ParamGridBuilder()
             .addGrid(xgb.maxDepth, [5, 7, 10])
             .addGrid(xgb.eta, [0.01, 0.1])
             .addGrid(xgb.numRound, [10, 50])
             .build())

crossval = CrossValidator(estimator=xgb,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=10)

cvModel = crossval.fit(train_data)

Model Evaluation

# Predict on test set
predictions = cvModel.transform(test_data)

# Classification report: compute accuracy and other metrics
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction")
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: "accuracy"})
f1 = evaluator.evaluate(predictions, {evaluator.metricName: "f1"})
weightedPrecision = evaluator.evaluate(predictions, {evaluator.metricName: "weightedPrecision"})
weightedRecall = evaluator.evaluate(predictions, {evaluator.metricName: "weightedRecall"})

print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
print(f"Weighted Precision: {weightedPrecision}")
print(f"Weighted Recall: {weightedRecall}")

Conclusion

With this article, we illustrated the practical application of the XGBoost model using PySpark on the Azure Databricks platform. The combined power of Azure Databricks, PySpark, and XGBoost ensures a robust and scalable solution for large datasets. While the provided code offers a foundational framework, model validation and interpretation, supported by domain expertise, are critical for tangible outcomes.

CHESTER SWANSON SR.

Realtor Associate @ Next Trend Realty LLC | HAR REALTOR, IRS Tax Preparer

1 年

Thanks for Sharing.

2 次回应

要查看或添加评论，请登录

Chirag S.的更多文章

Simulating a Single Server Queue in Python

2024年6月8日

Simulating a Single Server Queue in Python

Simulating a single server queue involves modeling a system where customers (or jobs) arrive, wait in line if the…
Types of Sampling in Machine Learning

2023年11月5日

Types of Sampling in Machine Learning

Hi Aspiring Data Scientists - Today, let's dive into the different types of sampling methods in machine learning, their…

2 条评论
?? Optimizing Insurance Claims Classification through Advanced NLP and XGBOOST Deployment ??

2023年10月8日

?? Optimizing Insurance Claims Classification through Advanced NLP and XGBOOST Deployment ??

In the vast ocean of Natural Language Processing (NLP), techniques like TF-IDF, Stemming, and Lemmatization emerge as…

2 条评论
The Power and Performance of List Comprehension in Python

2023年9月28日

The Power and Performance of List Comprehension in Python

Hello fellow Python enthusiasts! Today, we are diving deep into one of Python's more elegant features: List…

1 条评论
Checking for the Assumptions of Linear Regression using the mtcars dataset ????

2023年9月27日

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Linear regression is an instrumental tool in data science. Like all statistical methods, it rests on certain…

3 条评论
Understanding Transformers: A Deep Dive with PyTorch

2023年9月25日

Understanding Transformers: A Deep Dive with PyTorch

Transformers, since their inception in 2017 with the paper "Attention Is All You Need" by Vaswani et al., have sparked…

3 条评论
A Deep Dive into Convolutional Neural Networks (CNNs) on LinkedIn

2023年9月22日

A Deep Dive into Convolutional Neural Networks (CNNs) on LinkedIn

In today's fast-paced digital era, images are everywhere. Whether it's for facial recognition, autonomous driving, or…

1 条评论
Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

2023年7月15日

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

Hi Connections! Today, we're diving deep into handling a big data problem using XGBoost and Azure Databricks. I'll…
Gradient Descent and its Applications in Deep Learning

2023年6月2日

Gradient Descent and its Applications in Deep Learning

In this article, I'll provide a detailed explanation of gradient descent and also include a sample Python code snippet…
My Review of Georgia Tech's Online Master of Science in Analytics So Far

2021年10月11日

My Review of Georgia Tech's Online Master of Science in Analytics So Far

In January 2020, I started my second Master of Science program in Analytics from Georgia Tech. Prior to starting OMSA…

4 条评论

See all articles

Building an XGBoost Multi-class Classification Model using PySpark on Azure Databricks

Chirag S.

Staff Data Scientist with 8+ years of work experience in data science & analytics | Graduate Student in Computational Data Analytics at Georgia Tech | M.S. Operations Research, Northeastern University

Introduction

1. XGBoost - A Brief Overview

2. Kmeans Clustering for Imputation

PySpark Implementation on Azure Databricks

Setting Up and Importing Libraries

Loading the Dataset from Azure Databricks Delta Table

Kmeans for Imputation of Missing Values in Dataset

领英推荐

Correlation Matrix Between Response Variable and Input Variables

Train-Test Split

XGBoost Model Training with CrossValidator

Model Evaluation

Conclusion

Chirag S.的更多文章

社区洞察

其他会员也浏览了

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?

Innovate ML Model store with MongoDB as a Service

What are data science, big data, and machine learning?

Building and Deploying Machine Learning Models at Scale: Harnessing the Power of Azure and Kubernetes

Company Closeup: Databricks – From Academia to AI

Databricks vs. Snowflake: Choosing the Right Platform for Your ML Workflow

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

Kafka as a Feature Store for Machine Learning

Top resources recommended by Experts from Sofixit's Podcast Series

Databricks for Machine Learning: A Comprehensive Guide

Introduction

1. XGBoost - A Brief Overview

2. Kmeans Clustering for Imputation

PySpark Implementation on Azure Databricks

Setting Up and Importing Libraries

Loading the Dataset from Azure Databricks Delta Table

Kmeans for Imputation of Missing Values in Dataset

领英推荐

Correlation Matrix Between Response Variable and Input Variables

Train-Test Split

XGBoost Model Training with CrossValidator

Model Evaluation

Conclusion

Chirag S.的更多文章

Simulating a Single Server Queue in Python

Types of Sampling in Machine Learning

?? Optimizing Insurance Claims Classification through Advanced NLP and XGBOOST Deployment ??

The Power and Performance of List Comprehension in Python

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Understanding Transformers: A Deep Dive with PyTorch

A Deep Dive into Convolutional Neural Networks (CNNs) on LinkedIn

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

Gradient Descent and its Applications in Deep Learning

My Review of Georgia Tech's Online Master of Science in Analytics So Far

社区洞察

其他会员也浏览了

DATA Pill #029 - What is the future of Apache Flink? And what do football and LoL have to do with DATA?

Innovate ML Model store with MongoDB as a Service

What are data science, big data, and machine learning?

Building and Deploying Machine Learning Models at Scale: Harnessing the Power of Azure and Kubernetes

Company Closeup: Databricks – From Academia to AI

Databricks vs. Snowflake: Choosing the Right Platform for Your ML Workflow

Accelerated Data Analytics: Machine Learning with GPU-Accelerated Pandas and Scikit-learn

Kafka as a Feature Store for Machine Learning

Top resources recommended by Experts from Sofixit's Podcast Series

Databricks for Machine Learning: A Comprehensive Guide