ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Clustering Players Game Transactions with Amazon SageMaker

Domenico Vacchiano

Co-Founder at Cloud Crafter

å‘å¸ƒæ—¥æœŸ: 2024å¹´5æœˆ28æ—¥

In this article, we will explore a practical application of K-Means clustering combined with Principal Component Analysis (PCA) to analyze and visualize a dataset of gambling transactions. We'll explain each step of the process, including data preprocessing, clustering, and visualization, providing detailed insights into how these techniques work and how they can be applied to real-world data. We will also discuss the potential applications of clustering in identifying patterns, detecting fraud, and managing risk in gambling transactions.

Amazon SageMaker and SageMaker Studio will be our main tools throughout this article.

Dataset Overview

Our dataset (Online Gaming Transaction History), available on Kaggle, consists of gambling transaction records with the following structure:

Date/Time: The date and time of the transaction.
Category: The category of the transaction (Banking, Bonus, Gaming, Lottery).
Transaction Type: The type of transaction (Stake, Win, Deposit, Activated, Expired, Purchase, Prize, Payment).
Wallet: The type of wallet used (Cash, Casino).
Amount: The amount involved in the transaction.
End Balance: The balance after the transaction.
Description: The description of the transaction.

The dataset does not include Player IDs, so we will cluster the transactions based on their features.

K-Means Clustering

The algorithm we're using here is K-means. K-Means is a popular unsupervised machine learning algorithm used for partitioning a dataset into a set of distinct, non-overlapping groups (or clusters) based on their features. The main goal of K-Means is to divide the data points into clusters such that points within a cluster are more similar to each other than to points in other clusters. Amazon SageMaker provides a built-in K-Means algorithm designed to simplify the process of clustering large datasets. With this algorithm, users can perform scalable and efficient clustering tasks without the need to manage underlying infrastructure or worry about optimization techniques. SageMaker's K-Means algorithm utilizes distributed computing to handle large volumes of data, enabling faster processing times and improved scalability. It works by first randomly selecting a predefined number of cluster centroids within the data space. Then, it iteratively assigns each data point to the nearest centroid, forming initial clusters. After that, it recalculates the centroids of these clusters by computing the mean of the data points assigned to each cluster. The resulting clusters represent groups of data points that are closer to each other

Data Preprocessing

Before applying K-Means, we need to preprocess the data, to ensure it is in the right format for clustering.

Convert Date / Time feature to datetime: This allows us to extract time-based features.
Encode Categorical Features: Convert categorical variables into numerical values using LabelEncoder.
Extract Time-Based Features: Create new features such as Hour, Day, Month, and DayOfWeek from the Date / Time column.
Drop the original Date/Time column: Itâ€™s no longer needed after extracting time-based features.
Drop the Description column: this columns does not help for the clustering.
Rename Columns: columns Date / Time, Transaction Type, Amount (CAD), and End Balance will be renamed using a better standard format.
Standardize Numerical Features: Normalize the Amount and End Balance features to have zero mean and unit variance using StandardScaler.

After data preprocessing and prior to applying the StandardScaler, our dataset looks as follows:

Determine the optimal number of clusters using the elbow method

To determine the optimal number of clusters, we use the elbow method, which involves running K-Means for a range of k values and plotting the sum of squared distances (inertia) for each k. The "elbow" point in the plot indicates the optimal k, where adding more clusters does not significantly reduce the inertia.

In our case, as is common with most real-world datasets, there isn't a distinct elbow inflection point to clearly identify the optimal 'K'. This ambiguity can lead to selecting an incorrect number of clusters. From the graph, we can deduce that the optimal number of clusters is likely 2, but it could also be 6 pr 7. For this reason, we will conduct a preliminary exploratory analysis of K-Means using scikit-learn, which allows us to quickly obtain results and visualise them with matplotlib.

Pre-Train Clustering Experiments

We will be comparing the plot of the clustering results using as the optimal number of clusters 2, 3, 4 and 6., so to get an understanding of how points are distributed.

é¢†è‹±æŽ¨è

SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction

SwiftKV: Accelerating Enterprise LLM Workloads withâ€¦

Snowflake 3 ä¸ªæœˆå‰

My Experience with Amazon Q: Use Cases

Ofir Nachmani 7 ä¸ªæœˆå‰

Data Readiness with AWS: Empowering Your Generative AI Journey

Data Readiness with AWS: Empowering Your Generative AIâ€¦

NorthBay Solutions 7 ä¸ªæœˆå‰

Comparing the two clustering images, ti seems that most likely the correct number of cluster to assume the optimal one is 2, and we should consider inspecting certain data points within the Cluster 0 as they fall outside of the standard boundaries. After completing this preliminary analysis, we are now ready to train the model using SageMaker and then deploy an endpoint for making inferences and evaluating the segmentation of our dataset.

Clustering Data with Amazon SageMaker

The first thing to do is to convert the DataFrame to a NumPy array and upload it to S3 in the Protobuf format required by SageMaker.

We have now to specify the K-Means algorithm container and configure the training job.

And now, we're all set to start training the model!

# Start training
kmeans.fit(kmeans.record_set(data_np))

Keep in mind that this operation will require some time to complete.

After the model is trained, we can deploy an endpoint to perform inferences, and this can be done with just a single line of code!

# Deploy the model to an endpoint
kmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

What Amazon offers out of the box it's simply incredible! With just a few lines of code we will be able to quickly and efficiently develop ML models without having to worry about the underlying infrastructure and to accelerate the end-to-end ML workflow, from data preparation to model deployment, empowering us to focus more on solving business problems and less on managing infrastructure and tooling.

Plot 2D and 3D Results

To visualize the clusters, we use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset to 2 or 3 principal components. PCA transforms the data into a new coordinate system where the greatest variances lie on the first few axes (principal components). Below is the code snippet to visualize the 2D and 3D representations of the K-means clustering.

Here are the two graphical representations.

Conclusion

In this article, we covered the application of K-Means clustering and Principal Component Analysis (PCA) in analysing a dataset of gambling transactions. Our journey included various stages, from data preprocessing to cluster visualization, with the final step being the interpretation of results. The visualization provided valuable insights into the underlying structure of the data, revealing distinct clusters of transactions. It also highlighted potential cases for further inspection, as they fell outside the normal boundaries of the clusters.

If you're interested in accessing the complete Jupyter Notebook, please feel free to reach out to me!

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Domenico Vacchianoçš„æ›´å¤šæ–‡ç«

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

2025å¹´1æœˆ17æ—¥

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

Introduction A multi-agent system consists of a collection of intelligent agents, each tasked with a specific roleâ€¦
Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

2025å¹´1æœˆ11æ—¥

Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

Introduction Imagine having a solution where your architecture diagrams seamlessly transform into structured andâ€¦
Building a Real-Time Speech Translator Using Amazon's AI Services

2024å¹´9æœˆ29æ—¥

Building a Real-Time Speech Translator Using Amazon's AI Services

Throughout this article, we will explore the development of a real-time speech translation application built usingâ€¦
Building a Real-Time Player Bonus Reward System Using Neural Networks

2024å¹´6æœˆ1æ—¥

Building a Real-Time Player Bonus Reward System Using Neural Networks

Introduction In the competitive world of online gambling, player retention and engagement are critical for businessâ€¦
Build a Semantic Search Engine Using Sentence Transformers

2024å¹´5æœˆ18æ—¥

Build a Semantic Search Engine Using Sentence Transformers

Introduction In today's data-driven world, the ability to quickly and accurately search through vast amounts of text itâ€¦
Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

2024å¹´4æœˆ24æ—¥

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

Introduction A while back, I came across an intriguing article on the AWS Machine Learning Blog that captured myâ€¦
Engineering Team Spotlight

2021å¹´2æœˆ22æ—¥

Engineering Team Spotlight

After more than 20 years working in tech as a developer, an architect and a technology manager, in January 2021 Iâ€¦
API Composition Pattern with GraphQL

2020å¹´4æœˆ2æ—¥

API Composition Pattern with GraphQL

Introduction When you decide to embrace a microservices architecture, you need to be prepared to face severalâ€¦
Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

2020å¹´2æœˆ8æ—¥

Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

Introduction In a microservices architecture a single application, performing one or more operations, can trigger aâ€¦
Fn Project & Node.Js: playing with a wheel of fortune!

2019å¹´12æœˆ14æ—¥

Fn Project & Node.Js: playing with a wheel of fortune!

Introduction Almost a year ago, I published another article (https://www.linkedin.

See all articles

Clustering Players Game Transactions with Amazon SageMaker

Domenico Vacchiano

Co-Founder at Cloud Crafter

Dataset Overview

K-Means Clustering

Data Preprocessing

Determine the optimal number of clusters using the elbow method

Pre-Train Clustering Experiments

é¢†è‹±æŽ¨è

Clustering Data with Amazon SageMaker

Plot 2D and 3D Results

Conclusion

Domenico Vacchianoçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Unleashing The Power of Data: Inside The AWS Financial Services Symposium

Demystifying Gen AI and Harnessing Data for Innovation on AWS

AWS re:Invent 2024 â€“ AI, Analytics, Silicon, Storage and Data Observability

Week 34 (19 Aug - 25 Aug)

Revolutionizing Financial Services Workflow Through Artificial Intelligence with AWS

Unlocking the Power of Generative AI with AWS Services

MLCommons: Bench More AI Weight With Less Pain

Gentle Intro to Data Streaming Landscape

DATA Pill #047 - Leaving Amazon after 7.5 years, a new method for Kubernetes integration, and more

Part#3: Data& AI, Amazon, AWS and the future

Dataset Overview

K-Means Clustering

Data Preprocessing

Determine the optimal number of clusters using the elbow method

Pre-Train Clustering Experiments

é¢†è‹±æŽ¨è

Clustering Data with Amazon SageMaker

Plot 2D and 3D Results

Conclusion

Domenico Vacchianoçš„æ›´å¤šæ–‡ç«

Simplify Exploratory Data Analysis and Data Cleaning With Multi-Agent Systems.

Multi-Agent Systems: Automating Infrastructure as Code Generation from Architecture Diagrams

Building a Real-Time Speech Translator Using Amazon's AI Services

Building a Real-Time Player Bonus Reward System Using Neural Networks

Build a Semantic Search Engine Using Sentence Transformers

Train a Model with Neural Networks, for Responsible Gaming Predictions and Monitoring

Engineering Team Spotlight

API Composition Pattern with GraphQL

Distributed Tracing: Instrumenting and tracing NodeJs microservices with Zipkin

Fn Project & Node.Js: playing with a wheel of fortune!

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Unleashing The Power of Data: Inside The AWS Financial Services Symposium

Demystifying Gen AI and Harnessing Data for Innovation on AWS

AWS re:Invent 2024 â€“ AI, Analytics, Silicon, Storage and Data Observability

Week 34 (19 Aug - 25 Aug)

Revolutionizing Financial Services Workflow Through Artificial Intelligence with AWS

Unlocking the Power of Generative AI with AWS Services

MLCommons: Bench More AI Weight With Less Pain

Gentle Intro to Data Streaming Landscape

DATA Pill #047 - Leaving Amazon after 7.5 years, a new method for Kubernetes integration, and more

Part#3: Data& AI, Amazon, AWS and the future

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†