Customer Segmentation using RFM Analysis in Python
Capstone Project, General Assembly, London, Feb 2020
Dataset: https://archive.ics.uci.edu/ml/datasets/online+retail
Introduction
In Retail and E-Commerce (B2C), and more broadly in B2B, one of the key elements shaping the business strategy of a firm would be understanding of customer behaviour. More specifically, understanding their customers based on different business metrics: how much they spend (revenue), how often they spend (frequency), are they new or existing customer, what are their favorite products, etc... Such understanding would in turn help direct marketing, sales, account management and product teams to support better this customer and improve the product offering.
The objective of this project has been to:
Identify customer segments based on the overall buying behaviour of the client
Stakeholders
Typically, the stakeholder group within the firm that would benefit from such analysis and insights would be:
- Product/Services: Products selling more than others, would be an opportunity to evaluate the product offering or improve specific product features.
- Operations/Logistics: From stock management perspective, understanding which products are in demand would reduce storage costs and improve delivery/logistics operations.
- Marketing: Understanding of the customer segments, would allow for more effective and targeted marketing to specific customer groups, by creating a base campaign with core content for the broader client base but specific variations depending on the segment.
- Sales/Account Management: Identifying which customers are the most valuable and understanding their trends would go a long way towards building a genuine relationship, thus retaining existing customers and attracting new with the ideal customer profile.
Dataset
Online Retail Data Set https://archive.ics.uci.edu/ml/datasets/online+retail
The dataset used is a fairly popular data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail e-Commerce business. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
Outcome
The outcome of this project would be:
Create an unsupervised model that generates the optimum number of segments for the customer base.
Success Criteria
In order to make sure we understand the impact and success of the project, we've set some measurable criteria with regards to our analysis:
Segments generated can be interpreted and transposed into business actions.
Exploratory Data Analysis - EDA
Load Data
Lets start by importing key libraries and the dataset that has been downloaded from the link above and stored locally in an offline directory.
# Import key Libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from datetime import datetime # Load data from Local Repository retail = pd.read_excel('../offline-datasets/Online Retail.xlsx')
Data Overview
Once we've loaded up the dataset onto "retail" dataframe we can start reviewing the data:
Following a quick inspection and scan of our data set, there are a number of issues that have been identified:
InvoiceNo
- Most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types.
- InvoiceNo needs to be broken down to Number and Code (New Feature).
- Codes will need to be renamed where appropriate (Data Processing).
CustomerID
- Empty or Null values in around ~135k records.
- As such, we need to identify the active population and drop the empty records (Data Processing).
Description
- Empty or Null values in around ~1.5k records.
- Again, we need to identify the active population and drop the empty records (Data Processing).
Quantity
- Negative values in around ~10k records.
- Here, we need also identify the active population and drop records based on specific rules (Data Processing).
UnitPrice
- Negative values present ~2 records.
- As per above, identify the active population and drop the empty records based on specific rules (Data Processing).
Dates
- Need to extract dates from InvoiceDate as a new feature and also month and year as separate new features too (New Feature).
Feature Engineering
Before proceeding with some more EDA and processing, lets start summarise some of the new features identified in the Initial Data Review above, which can be quite straightforward off the back of the existing dataset:
1) Break down InvoiceNo to
- InvoiceNumber - new feature
- InvoiceCode - new feature
C – Cancellation
A – Adjustment
Null – Rename to “N” – Normal
2) Break down InvoiceDate to
- InvoiceYearMonth - new feature
- InvoiceYear - new feature
- InvoiceMonth - new feature
- Date - Use as Index
The rest of the Data Analysis and pre-processing will involve:
- Defining Active population based on feature rules
- Creating Masks for filtering out bad data (empty or Null Values, negative prices, etc...) that don't contribute towards the active population
- Visualising some of our data to get a better idea of the population
Active Population
Now that new features are in place and based on the initial data review summary, we need to derive an active population for our modelling stage:
- A tree based rule set has been defined using the features from the dataset.
- Each feature has one or more rules that define its active population.
- Rules are represented by different masks.
Masks for data processing
Creating various Masks to implement the rules per feature for narrowing down the active population:
# Valid Price valid_price = retail.UnitPrice >= 0 # Valid Description valid_desc = retail.Description.notnull() # Valid CID valid_CID = retail.CustomerID.notnull() # Invoice type-N (Normal) inv_N = retail.InvoiceCode == "N" # Invoice type-C (Cancellation) inv_C = retail.InvoiceCode == "C" # Invoice type-N (Amendment) inv_A = retail.InvoiceCode == "A" # Quantity Negative q_neg = retail.Quantity < 0 # Quantity Positive q_pos = retail.Quantity >= 0
Finally, we're applying the masks to the original dataset navigating down the tree in order to arrive to the end nodes which are the active population.
A new dataframe is created to save separately as "retail_pp".
# Path1 - Filter population down to include all # valid Customer IDs with Valid Price and Description p1 = valid_price & valid_desc & valid_CID # Path2 - Filter population down to include all # Normal (type-N) transactions with Positive Quantities p2 = inv_N & q_pos # Path3 - Filter population down to include all # Cancel (type-C) or Adjust (type-A) transactions # with Negative Quanitities p3 = (inv_A | inv_C) & q_neg # Path to Leafs: Combine Paths 1, 2 and 3: # *************** CREATE A COPY ************ retail_pp = retail.loc[p1 & (p2 | p3)].copy() # Inspect new pre-processed and clean dataset retail_pp.head()
Number of Orders(Invoices) Visualisations
There is a separate section below referring to Key Performance Indicators whereby we capture, analyse and visualise most of the data. In this section, we would like to present two visuals around the number of order per countries and the number of orders per invoice:
As we can see most orders are from United Kingdom and while we could have kept data only from UK, we have decided to keep all active data within the Active population without dropping data from other countries.
In the bottom graph, there is a small tail of data around the 50-100 mark that we would like to explore further. As such we created the top graph, that has zoomed in the top 100 largest orders to visualise this better.
Orders summary
Top graph
There is a small amount of Invoices (less than 200) that contain a large amount of products ordered in a single invoice. e.g.
- A single invoice has 1100 products
- Another single invoice has 800 products
- Four invoices have 700 products
- In total, up to 200 number of invoices have more than 100 products each
Bottom graph
Most invoices (24000+) have up to 100 each with the exception of a small population (<200) that have more than 100 products each.
Key Performance Indicators
Transaction KPIs
This involves analysis around customer transaction data grouped by month. i.e. Revenue, Revenue growth, Order numbers, revenue per order, etc...
Monthly Revenue
The formula for calculating the revenue on a monthly basis is shown below:
- Calculate Order Value for each row by multiplying Unit Price with Quantity
- Group by Year and Month using the respective elements of the index of the dataframe
- Sum Order Values grouped by Year and Month
Monthly Revenue Growth - Month on month percentage comparison of Revenue change.
Monthly Order Count - Count Invoices per grouping.
Average revenue per order - Average monthly revenue above.
Customer KPIs
This involves analysis around number of active customers and segregation between new and old customers from a Revenue perspective:
Active Customers - Unique count of Customer IDs per grouping.
Revenue of New vs Old Customer - Define new feature to distinguish Revenue from New or Existing Customers.
KPI Dataframes
In order to capture the key Monthly KPIs, the main retail_pp DataFrame has been used. The new dataframe was generated as Multi-Index one by:
- Grouping retail_pp based on Month and Year and creating an index based on these
- Renaming the index levels to Year and Month
- Generating and appending each new KPI on the new dataframe
- Plotting the KPIs using Line or Graph bars as required
# 1. Revenue - Generate Monthly KPIs DataFrame by aggregating Revenue m_kpis = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Revenue'].sum()) # 2. Generate Monthly Growth rate based on previous months revenue m_kpis['MonthlyGrowth'] = m_kpis['Revenue'].pct_change() # 3. Generate Active Customers m_kpis['ActiveCustomers'] = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['CustomerID'].nunique()) # 4. Generate Monthly Order Count (Quantity) m_kpis['MonthlyOrderCount'] = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Quantity'].sum()) # 5. Gengerate Monthly Order Average m_kpis['MonthlyOrderAverage'] = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Revenue'].mean()) # Rename index to capture Year and Month m_kpis.index.set_names(['Year', 'Month'], inplace = True) m_kpis
The new multi-index dataframe generated for the monthly KPIs looks like below:
For the Customer KPIs dataframe, and in order to differentiate between an existing and new customer:
- Created a new dataframe that captures the first purchase date of each customer grouped by CustomerID
- Merged the new feature (First Purchase Date) to the existing retail_pp dataframe based on CustomerID
- Created a new feature (User Type) in retail_pp dataframe based on the the Invoice Date of the current month being before or after the First Purchase date
- Created Customer KPIs dataframe grouping by Year, Month and UserType and adding the Revenue per month
# Generate new dataframe based on CustomerID and its first purchase date customer_fist_purchase = retail_pp.groupby('CustomerID').InvoiceDate.min().reset_index() customer_fist_purchase.columns = ['CustomerID','FirstPurchaseDate'] customer_fist_purchase['FirstPurchaseYearMonth'] = customer_fist_purchase['FirstPurchaseDate'].map(lambda date: 100*date.year + date.month) # Add first purchase date column to the new Customer Dataframe by merging with the original retail_pp retail_ppp = pd.merge(retail_pp, customer_fist_purchase, on='CustomerID') retail_ppp['Date'] = pd.to_datetime(retail_ppp.InvoiceDate.dt.date) retail_ppp.set_index('Date', inplace=True) retail_ppp.head() # Create new column "User Type" and default it to "New" as its values. Assign "Existing" value if # User's "FirstPurchaseYearMonth" was before the selected "InvoiceYearMonth" retail_ppp['UserType'] = 'New' retail_ppp.loc[retail_ppp['YearMonth']>retail_ppp['FirstPurchaseYearMonth'],'UserType'] = 'Existing' # Calculate the Revenue per month for each user type customer_kpis = pd.DataFrame(retail_ppp.groupby([retail_ppp.index.year,retail_ppp.index.month, retail_ppp.UserType])['Revenue'].sum()) customer_kpis.index.set_names(['Year', 'Month', 'UserType'], inplace = True) customer_kpis
Our new dataframe looks like below:
Visualisations
Structuring our data in a multi-index dataframe, makes it very useful to start plotting in various forms like line plots or bar plots as shown below.
As per diagram below, during April there has been a significant drop in Revenue growth which is not part of the typical variation you see during June or July the same year. Also, worth noting that the sudden drop at the end of the graph is due to lack of data in Dec rather than and actual business metric.
# Line Graph to capture the monthly revenue m_kpis.plot.line(y='Revenue', title='Revenue per month',color = 'r', marker='h', markersize=10, linewidth=2, markerfacecolor='white');
When reviewing the rate of growth bar graph, it confirms the negative drop during April as the biggest spike of negative growth.
# Horizontal Bar Graph to capture monthly revenue growth m_kpis.plot.barh(y='MonthlyGrowth',title='Monthly Growth', rot=0);
Again, the monthly order count graph, shows a consistent picture of the sales of goods: during April there was a significant drop of the Quantities Ordered while the rest of the graph shows an organic growth of the business.
# Horizontal Bar graph to capture number of quantities ordered per month m_kpis.plot.line(y='MonthlyOrderCount',title='Quantities ordered per Month',color = 'b', marker='h', markersize=10, linewidth=2, markerfacecolor='white');
Finally, the Monthly order graph has been selected to explore the average value of an order and its variation during research window. There seems to be a) a small variation between the orders b) slight increase of the average order c) significant drop in the value of order during April in comparison with the average value in the neighbouring months.
# Horizontal Bar graph to capture number of quantities ordered per month m_kpis.plot.barh(y='MonthlyOrderAverage',title='Monthly Order Average value');
On the Customer KPIs, it's interesting to confirm that this has been the same picture, i.e. organic growth of the customer base with a similar drop in April.
# Line graph to capture the active customers per month m_kpis.plot.line(y='ActiveCustomers', title='Active customers per month',color = 'g', marker='h', markersize=10, linewidth=2, markerfacecolor='white');
In the diagram below we can see clearly a healthy increase of the existing customer base. Regarding new customers, following a natural decrease after the initial customer acquisition, we can see the Revenue stabilising to a healthy steady amount (i.e. circa ~$100000).
# Plot new vs existing customers customer_kpis.unstack().plot.bar(y='Revenue', title='New vs Existing CUstomer Monthly Revenue', rot=0);
Modelling using RFM Analysis
In order to do Customer Segmentation, the RFM modelling technique has been used. RFM stands for Recency - Frequency - Monetary Value with the following definitions:
- Recency - Given a current or specific date in the past, Recency captures the last time that the customer made a transaction.
- Frequency - Given a specific time window, Frequency captures the number of transactions the customer did during that window.
- Monetary Value or Revenue - Given a specific window, Revenue captures the amount spent by the customer.
Based on the values above, one can construct various segments that classify the customers to:
Low value Segment
- R - Less active
- F - Not very frequent
- M - Low spend
Mid value Segment
- R - Medium or inconsistent activity
- F - infrequent or frequent at specific times
- M - varied spending but overall in the mid of the scale
High value Segment
- R - Very active
- F - Very frequent
- M - Great spending profile
Modelling Approach
Going forward the approach for deriving customer segments would be:
- Calculate RFM scores individually and plot them to get an idea of their RFM values and ranges
- Use K-Means clustering unsupervised learning algorithm to create these 3 clusters above
- Use techniques to optimise the number of clusters based on Silhouette and Inertia scores
Recency
To calculate Recency, we first start by creating a new data frame to hold all the RFM scores per customer by grouping per CustomerID.
Then we create another dataframe based on the maximum Invoice Date grouped by CustomerID and fix an observation point (here it's the last invoice day in the dataset) as a line in the sand point.
Finally, we generate the Recency in days by subtracting the Last Invoice Day (line in the sand point) for that customer from the Last Purchase Date:
As per code below, we merge the two dataframes on CustomerID to create the Recency feature in our new Customer Dataframe.
# Generate new dataframe based on unique CustomerID to keep track of RFM scores customer = pd.DataFrame(retail_pp['CustomerID'].unique()) customer.columns = ['CustomerID'] # Generate new data frame based on latest Invoice date from retail_ppp dataframe per Customer (groupby = CustomerID) recency = retail_ppp.groupby('CustomerID').InvoiceDate.max().reset_index() recency.columns = ['CustomerID','LastPurchaseDate'] # Set observation point as the last invoice date in the dataset LastInvoiceDate = recency['LastPurchaseDate'].max() # Generate Recency in days by subtracting the Last Purchase date for each customer from the Last Invoice Date recency['Recency'] = (LastInvoiceDate - recency['LastPurchaseDate']).dt.days # Consolidate to customer DataFrame customer = pd.merge(customer, recency[['CustomerID','Recency']], on='CustomerID') customer.head()
Frequency
Frequency metric reflects the number of orders per Customer, so a simple count of the invoices grouped per Customer ID would do:
# Count number of invoices per CustomerID and store in new frequency Dataframe frequency = retail_ppp.groupby('CustomerID').InvoiceDate.count().reset_index() frequency.columns = ['CustomerID','Frequency'] # Consolidate Frequency to existing Customer DataFrame customer = pd.merge(customer, frequency, on='CustomerID')
Monetary Value (Revenue)
For Revenue, the grouping goes by CustomerID, and the aggregation function is Sum.
# Revenue per transaction has already been calculated as per KPIs section # Grouping revenue per Customer ID revenue = retail_ppp.groupby('CustomerID').Revenue.sum().reset_index() # Consolidate Revenue to existing Customer DataFrame customer = pd.merge(customer, revenue, on='CustomerID') customer.head()
The dataframe now looks like below:
If we plot the histograms for each of R, F and M values we get the following;
Note, there are some negative values in the Revenue, but that is explained by some returns or cancellations of products and have been intentionally kept within the active population as long as it falls within the rest criteria set by the masks created.
If we create a Scatter plot of the pairs we can visualise the relationship between the R, F, M variables and see if there are any obvious segments:
There's clearly some distinction between high value segments and low value ones, but the difficulty in inspecting such scatter plots visually is finding the ideal boundaries and also identifying the middle segment.
K-Means Clustering
For that purpose, we will use the power of K-Means clustering, an unsupervised learning method that can help us identify a specific number of clusters, even without labels.
The steps followed are:
- Set (arbitrarily) the number of clusters: here we'll aim for K=3
- Normalise the R, F, M values of data using StandardScaler method
- Train model and review results (centeroids and assigned clusters)
- Visualise results
- Calculate evaluation metrics (Inertia & Silhouette score)
- Optimise K based on multiple runs for a range of Ks and evaluate using the Elbow graph
- Review final customer segments
Here's the code that implements steps 1-3:
# Number of clusters (Configurable) - Initially we're attempting based on the 3 Segments (Low, Mid, High) k=3 # Scaling input using StandardScaler from sklearn.preprocessing import StandardScaler scaler = StandardScaler() Xstd = pd.DataFrame(scaler.fit_transform(X), columns=X.columns) # Run and train main K-means algorithm based on all input features from sklearn.cluster import KMeans model = KMeans(n_clusters=k, random_state=0) model.fit(Xstd) # Review output cluster labels cluster_labels = model.labels_ print("Assigned cluster labels: \n", cluster_labels) # Review Centroids of clusters centroids = model.cluster_centers_ print("Location of centroids: ") print(centroids) # Append clusters to input features table Xstd['clusters'] = cluster_labels Xstd.head()
Now, we can visualise the results and get an initial view of the segments:
We can now clearly see:
- Yellow - High value segment
- Green - Mid value segment
- Purple - Low value segment
To evaluate the performance of K-Means clustering algorithm we use the following metrics:
- Inertia value - absolute value. The rule of thumb is:the smaller the number the better the fit.
- Silhouette Score - range between -1 and 1. Typically, high score indicates that the population is well matched to a cluster
Onto step 6, we'll try to optimise the results by running k-means for a range of Ks:
inertias = [] ss = [] for k in range(2,11): kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=42) kmeans.fit(Xstd) inertias.append(kmeans.inertia_) ss.append(metrics.silhouette_score(Xstd, kmeans.labels_, metric='euclidean'))
The elbow graph shows that we're just about right with 3 clusters, though you can also choose 4 or 5 (as a stretch).
Plotting the Silhouette score for a range of Ks also confirms that while beyond 5 we get a more accurate matching, it means that it probably over-fits and thus not optimal:
Finally onto step 7, the new Customer Dataframe with the assigned segments an labels looks like below:
If we plot the 3 different segments we get the following bar plot:
Business Actions
Once we have a clear view of the Segments above, specific actions can/should be planned across the Business (Operations, Marketing, Product, etc...) to address any potential issues;
- High Value: There's a small number of these clients and key priority for the business would be to improve Retention of these customers as they are the most valuable asset.
- Mid Value: There's a larger proportion of these customers who should be treated as potentially High Value ones with the right actions: aim to increase Retention and Frequency and bring them closer to the brand and the product so eventually become High Value.
- Low Value: the vast majority will be in this category and the first step in addressing this group would be to aim to increase Frequency and understand if there are any potential issues around the product or service.
Summary
To summarise our journey so far in this project, we've followed the steps below:
- Data Review & Understanding - Explored the dataset to understand the features and missing values.
- Feature Engineering & Processing - Created new features to help us uncover better insights.
- KPIs & Visualizations - Created a broad set of KPIs to get a better business view of the data.
- RFM analysis & new features - Used RFM analysis to model the data.
- Unsupervised learning K-Means clustering - Used unsupervised learning to tell us about the various data clusters.
- Optimization & Evaluation - Run K-means clustering for a range of Ks to optimise the algorithm.
- Business Actions & Segments Mapping - Translated the results of Machine Learning into Business Actions
Next Steps
As a next step, the work being done here can be leveraged to predict Customer Lifetime Value (CLTV).
This can be done using the RFM features created here as input to a Supervised Machine Learning model with output being the total amount spent by the customer in their lifetime (Revenue).
I hope you've found this article useful, it was certainly a pleasure working through this project.
GitHub
The full Jupyter Notebook for this project can be found here.
References
https://towardsdatascience.com/data-driven-growth-with-python-part-1-know-your-metrics-812781e66a5b
https://www.datacamp.com/community/tutorials/introduction-customer-segmentation-python
About Angelo
I’m a freelance Principal Management Consultant and applied Data Scientist with broad experience implementing Data Transformation and delivering Technology Change on behalf of Vendors and Global Banks in Financial Services and Government.
Coming from a solid academic background in Robotics Engineering, skills like mathematics, statistics and programming (luckily) came off the shelve for me, also coupled with the genuinely curious, analytical and inquisitive thinking of an Engineer.
This unique blend of technical skills and industry experience have helped me work with a broad range of Business and Technical teams in Technology and Data, helping them bridge the gaps and bring critical projects over the line thus delivering great value add to their organisation.
My interest going forward is to explore how to use data to drive better insights for sales and digital marketing teams, in order to help them retain their existing client base and also expand further to new ones.
Feel free to drop me a line at:
LinkedIn - https://www.dhirubhai.net/in/etzimopoulos/
GitHub - https://github.com/etzimopoulos
Medium - https://medium.com/@etzimopoulos
#customersegmentation #rfmanalysis #datascience #customerretention #digitalmarketing #capstone #machinelearning #managementconsulting #businessconsulting
Data Engineer
3 年Thanks so much
O'Reilly Author | AI Engineer | CTO @Vexpower | Prompt Engineer
4 年Awesome article, keep up the good work :)