Clustering Players Game Transactions with Amazon SageMaker

Clustering Players Game Transactions with Amazon SageMaker

In this article, we will explore a practical application of K-Means clustering combined with Principal Component Analysis (PCA) to analyze and visualize a dataset of gambling transactions. We'll explain each step of the process, including data preprocessing, clustering, and visualization, providing detailed insights into how these techniques work and how they can be applied to real-world data. We will also discuss the potential applications of clustering in identifying patterns, detecting fraud, and managing risk in gambling transactions.

Amazon SageMaker and SageMaker Studio will be our main tools throughout this article.

Dataset Overview

Our dataset (Online Gaming Transaction History), available on Kaggle, consists of gambling transaction records with the following structure:

  • Date/Time: The date and time of the transaction.
  • Category: The category of the transaction (Banking, Bonus, Gaming, Lottery).
  • Transaction Type: The type of transaction (Stake, Win, Deposit, Activated, Expired, Purchase, Prize, Payment).
  • Wallet: The type of wallet used (Cash, Casino).
  • Amount: The amount involved in the transaction.
  • End Balance: The balance after the transaction.
  • Description: The description of the transaction.

The dataset does not include Player IDs, so we will cluster the transactions based on their features.

K-Means Clustering

The algorithm we're using here is K-means. K-Means is a popular unsupervised machine learning algorithm used for partitioning a dataset into a set of distinct, non-overlapping groups (or clusters) based on their features. The main goal of K-Means is to divide the data points into clusters such that points within a cluster are more similar to each other than to points in other clusters. Amazon SageMaker provides a built-in K-Means algorithm designed to simplify the process of clustering large datasets. With this algorithm, users can perform scalable and efficient clustering tasks without the need to manage underlying infrastructure or worry about optimization techniques. SageMaker's K-Means algorithm utilizes distributed computing to handle large volumes of data, enabling faster processing times and improved scalability. It works by first randomly selecting a predefined number of cluster centroids within the data space. Then, it iteratively assigns each data point to the nearest centroid, forming initial clusters. After that, it recalculates the centroids of these clusters by computing the mean of the data points assigned to each cluster. The resulting clusters represent groups of data points that are closer to each other

Data Preprocessing

Before applying K-Means, we need to preprocess the data, to ensure it is in the right format for clustering.

  1. Convert Date / Time feature to datetime: This allows us to extract time-based features.
  2. Encode Categorical Features: Convert categorical variables into numerical values using LabelEncoder.
  3. Extract Time-Based Features: Create new features such as Hour, Day, Month, and DayOfWeek from the Date / Time column.
  4. Drop the original Date/Time column: It’s no longer needed after extracting time-based features.
  5. Drop the Description column: this columns does not help for the clustering.
  6. Rename Columns: columns Date / Time, Transaction Type, Amount (CAD), and End Balance will be renamed using a better standard format.
  7. Standardize Numerical Features: Normalize the Amount and End Balance features to have zero mean and unit variance using StandardScaler.

After data preprocessing and prior to applying the StandardScaler, our dataset looks as follows:

Determine the optimal number of clusters using the elbow method

To determine the optimal number of clusters, we use the elbow method, which involves running K-Means for a range of k values and plotting the sum of squared distances (inertia) for each k. The "elbow" point in the plot indicates the optimal k, where adding more clusters does not significantly reduce the inertia.

In our case, as is common with most real-world datasets, there isn't a distinct elbow inflection point to clearly identify the optimal 'K'. This ambiguity can lead to selecting an incorrect number of clusters. From the graph, we can deduce that the optimal number of clusters is likely 2, but it could also be 6 pr 7. For this reason, we will conduct a preliminary exploratory analysis of K-Means using scikit-learn, which allows us to quickly obtain results and visualise them with matplotlib.

Pre-Train Clustering Experiments

We will be comparing the plot of the clustering results using as the optimal number of clusters 2, 3, 4 and 6., so to get an understanding of how points are distributed.

Comparing the two clustering images, ti seems that most likely the correct number of cluster to assume the optimal one is 2, and we should consider inspecting certain data points within the Cluster 0 as they fall outside of the standard boundaries. After completing this preliminary analysis, we are now ready to train the model using SageMaker and then deploy an endpoint for making inferences and evaluating the segmentation of our dataset.

Clustering Data with Amazon SageMaker

The first thing to do is to convert the DataFrame to a NumPy array and upload it to S3 in the Protobuf format required by SageMaker.

We have now to specify the K-Means algorithm container and configure the training job.

And now, we're all set to start training the model!

# Start training
kmeans.fit(kmeans.record_set(data_np))        

Keep in mind that this operation will require some time to complete.

After the model is trained, we can deploy an endpoint to perform inferences, and this can be done with just a single line of code!

# Deploy the model to an endpoint
kmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')        

What Amazon offers out of the box it's simply incredible! With just a few lines of code we will be able to quickly and efficiently develop ML models without having to worry about the underlying infrastructure and to accelerate the end-to-end ML workflow, from data preparation to model deployment, empowering us to focus more on solving business problems and less on managing infrastructure and tooling.

Plot 2D and 3D Results

To visualize the clusters, we use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset to 2 or 3 principal components. PCA transforms the data into a new coordinate system where the greatest variances lie on the first few axes (principal components). Below is the code snippet to visualize the 2D and 3D representations of the K-means clustering.

Here are the two graphical representations.

Conclusion

In this article, we covered the application of K-Means clustering and Principal Component Analysis (PCA) in analysing a dataset of gambling transactions. Our journey included various stages, from data preprocessing to cluster visualization, with the final step being the interpretation of results. The visualization provided valuable insights into the underlying structure of the data, revealing distinct clusters of transactions. It also highlighted potential cases for further inspection, as they fell outside the normal boundaries of the clusters.

If you're interested in accessing the complete Jupyter Notebook, please feel free to reach out to me!

要查看或添加评论,请登录

Domenico Vacchiano的更多文章

社区洞察

其他会员也浏览了