Decoding Customer Behavior: My Journey with RFM Analysis and K-Means Clustering
Venugopal Adep
AI Leader | General Manager at Reliance Jio | LLM & GenAI Pioneer | AI Evangelist
On my adventure through RFM analysis and K-Means clustering, I uncovered fascinating insights into customer behaviors, segmenting them into meaningful groups based on how recently, how often, and how much they purchase. This journey not only helped me understand customer patterns better but also paved the way for targeted marketing strategies. Next up, I plan to dive deeper into these clusters, tailoring specific approaches to engage each group effectively, enhancing customer satisfaction and loyalty.
Link to my code:
Import libraries
from datetime import datetime
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
RFM (Recency, Frequency, Monetary) analysis is an excellent method for understanding customer value in a retail context. This approach segments customers based on:
To proceed with the RFM analysis and clustering, I'll first need to inspect the dataset to understand its structure and content. Let's start by loading the data and taking a look at the first few rows.
The dataset contains the following columns:
Before proceeding with the RFM analysis, I'll perform some data cleaning and preprocessing. This includes:
Checking and handling missing values, especially in crucial fields like CustomerID. Ensuring the correctness of data types, especially for dates and numeric fields. Creating a new column for the total amount spent per transaction (Quantity * UnitPrice). Let's start with these steps. Efficient Data Preparation:
Efficient Data Preparation: The First Step in Retail Data Analysis
In this code, I begin by loading a dataset of retail transactions and then embark on cleaning and restructuring the data. This involves converting 'InvoiceDate' to a usable datetime format, removing transactions with negative quantities, calculating the total price of each transaction, and ensuring that each record has a valid 'CustomerID'.
# Load the dataset
file_path = '/content/online_retail_data.csv'
retail_data = pd.read_csv(file_path)
# Convert 'InvoiceDate' to datetime
retail_data['InvoiceDate'] = pd.to_datetime(retail_data['InvoiceDate'], format='%d/%m/%y %H:%M')
retail_data = retail_data[retail_data['Quantity'] > 0] # Remove negative quantities
retail_data['TotalPrice'] = retail_data['Quantity'] * retail_data['UnitPrice']
retail_data = retail_data.dropna(subset=['CustomerID'])
retail_data['CustomerID'] = retail_data['CustomerID'].astype(int)
Unveiling Customer Insights: The Heart of RFM Analysis
In this code, I begin by loading a dataset of retail transactions and then embark on cleaning and restructuring the data. This involves converting 'InvoiceDate' to a usable datetime format, removing transactions with negative quantities, calculating the total price of each transaction, and ensuring that each record has a valid 'CustomerID'.
In this part of my code, I calculated the RFM (Recency, Frequency, Monetary) metrics by first setting a reference date (one day after the latest purchase in the dataset) and then grouped the data by customer. For each customer, I found out
This process is key to understanding customer behavior in detail.
# RFM Calculation
reference_date = retail_data['InvoiceDate'].max() + pd.Timedelta(days=1)
rfm_data = retail_data.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (reference_date - x.max()).days,
'InvoiceNo': 'count',
'TotalPrice': 'sum'
}).rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency', 'TotalPrice': 'Monetary'})
领英推荐
Finding the Perfect Balance: Normalization and Cluster Count in Data Analysis
In my code, I first normalized the Recency, Frequency, and Monetary values using the StandardScaler, making sure they're all on a comparable scale for K-Means clustering. Then, to find the ideal number of clusters, I used the Elbow Method, plotting the within-cluster sum of squares (WCSS) against different cluster counts, looking for the 'elbow' point where the WCSS starts to plateau
# Normalizing the RFM data for K-Means
scaler = StandardScaler()
rfm_normalized = scaler.fit_transform(rfm_data[['Recency', 'Frequency', 'Monetary']])
# Determining the optimal number of clusters using the Elbow Method
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(rfm_normalized)
wcss.append(kmeans.inertia_)
# Plotting the results to find the 'elbow'
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method to Determine Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()
Delving into Data: My K-Means Clustering Experience
In this part of my data journey, I applied K-Means Clustering to segment customers into four distinct groups based on their purchasing behavior. After configuring and running the K-Means algorithm on the normalized RFM data, I tagged each customer with their respective cluster, revealing intriguing patterns and groupings within the dataset.
# K-Means Clustering
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
clusters = kmeans.fit_predict(rfm_normalized)
rfm_data['Cluster'] = clusters
rfm_data
Decoding Customer Clusters: My Analysis Breakdown
In this final part of my data exploration, I grouped the customers into clusters and calculated the average recency, frequency, and monetary values for each cluster. This step was like taking a closer look at each group, understanding their unique shopping patterns. Finally, I counted the number of customers in each cluster, giving me a complete picture of how these groups were distributed in my dataset.
# Analyzing the Clusters
cluster_analysis = rfm_data.groupby('Cluster').agg({
'Recency': 'mean',
'Frequency': 'mean',
'Monetary': 'mean'
}).sort_values(by='Cluster', ascending=True)
cluster_analysis['Count'] = rfm_data.groupby('Cluster').size()
cluster_analysis
Cluster Analysis
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Insights
This clustering provides a nuanced view of different customer behaviors, which can inform targeted marketing strategies and customer engagement initiatives.