Customer Segmentation using RFM Analysis in Python

Customer Segmentation using RFM Analysis in Python

Capstone Project, General Assembly, London, Feb 2020

Dataset: https://archive.ics.uci.edu/ml/datasets/online+retail


Introduction

In Retail and E-Commerce (B2C), and more broadly in B2B, one of the key elements shaping the business strategy of a firm would be understanding of customer behaviour. More specifically, understanding their customers based on different business metrics: how much they spend (revenue), how often they spend (frequency), are they new or existing customer, what are their favorite products, etc... Such understanding would in turn help direct marketing, sales, account management and product teams to support better this customer and improve the product offering.

The objective of this project has been to:

Identify customer segments based on the overall buying behaviour of the client

Stakeholders

Typically, the stakeholder group within the firm that would benefit from such analysis and insights would be:

  1. Product/Services: Products selling more than others, would be an opportunity to evaluate the product offering or improve specific product features.
  2. Operations/Logistics: From stock management perspective, understanding which products are in demand would reduce storage costs and improve delivery/logistics operations.
  3. Marketing: Understanding of the customer segments, would allow for more effective and targeted marketing to specific customer groups, by creating a base campaign with core content for the broader client base but specific variations depending on the segment.
  4. Sales/Account Management: Identifying which customers are the most valuable and understanding their trends would go a long way towards building a genuine relationship, thus retaining existing customers and attracting new with the ideal customer profile.

Dataset

Online Retail Data Set https://archive.ics.uci.edu/ml/datasets/online+retail

The dataset used is a fairly popular data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail e-Commerce business. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

No alt text provided for this image

Outcome

The outcome of this project would be:

Create an unsupervised model that generates the optimum number of segments for the customer base.

Success Criteria

In order to make sure we understand the impact and success of the project, we've set some measurable criteria with regards to our analysis:

Segments generated can be interpreted and transposed into business actions.


Exploratory Data Analysis - EDA

Load Data

Lets start by importing key libraries and the dataset that has been downloaded from the link above and stored locally in an offline directory.

# Import key Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime

# Load data from Local Repository
retail = pd.read_excel('../offline-datasets/Online Retail.xlsx')

Data Overview

Once we've loaded up the dataset onto "retail" dataframe we can start reviewing the data:

No alt text provided for this image

Following a quick inspection and scan of our data set, there are a number of issues that have been identified:

InvoiceNo

No alt text provided for this image
  • Most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types.
  • InvoiceNo needs to be broken down to Number and Code (New Feature).
  • Codes will need to be renamed where appropriate (Data Processing).

CustomerID

No alt text provided for this image
  • Empty or Null values in around ~135k records.
  • As such, we need to identify the active population and drop the empty records (Data Processing).

Description

  • Empty or Null values in around ~1.5k records.
  • Again, we need to identify the active population and drop the empty records (Data Processing).

Quantity

No alt text provided for this image
  • Negative values in around ~10k records.
  • Here, we need also identify the active population and drop records based on specific rules (Data Processing).

UnitPrice

  • Negative values present ~2 records.
  • As per above, identify the active population and drop the empty records based on specific rules (Data Processing).

Dates

  • Need to extract dates from InvoiceDate as a new feature and also month and year as separate new features too (New Feature).

Feature Engineering

Before proceeding with some more EDA and processing, lets start summarise some of the new features identified in the Initial Data Review above, which can be quite straightforward off the back of the existing dataset:

1) Break down InvoiceNo to

  • InvoiceNumber - new feature
  • InvoiceCode - new feature

C – Cancellation

A – Adjustment

Null – Rename to “N” – Normal


2) Break down InvoiceDate to

  • InvoiceYearMonth - new feature
  • InvoiceYear - new feature
  • InvoiceMonth - new feature
  • Date - Use as Index
No alt text provided for this image


The rest of the Data Analysis and pre-processing will involve:

  • Defining Active population based on feature rules
  • Creating Masks for filtering out bad data (empty or Null Values, negative prices, etc...) that don't contribute towards the active population
  • Visualising some of our data to get a better idea of the population


Active Population

Now that new features are in place and based on the initial data review summary, we need to derive an active population for our modelling stage:

  1. A tree based rule set has been defined using the features from the dataset.
  2. Each feature has one or more rules that define its active population.
  3. Rules are represented by different masks.
No alt text provided for this image

Masks for data processing

Creating various Masks to implement the rules per feature for narrowing down the active population:

# Valid Price
valid_price = retail.UnitPrice >= 0


# Valid Description
valid_desc = retail.Description.notnull()


# Valid CID
valid_CID = retail.CustomerID.notnull()


# Invoice type-N (Normal)
inv_N = retail.InvoiceCode == "N"


# Invoice type-C (Cancellation)
inv_C = retail.InvoiceCode == "C"


# Invoice type-N (Amendment)
inv_A = retail.InvoiceCode == "A"


# Quantity Negative
q_neg = retail.Quantity < 0 


# Quantity Positive
q_pos = retail.Quantity >= 0  

Finally, we're applying the masks to the original dataset navigating down the tree in order to arrive to the end nodes which are the active population.

A new dataframe is created to save separately as "retail_pp".

# Path1 - Filter population down to include all 
# valid Customer IDs with Valid Price and Description
p1 = valid_price & valid_desc & valid_CID


# Path2 - Filter population down to include all 
# Normal (type-N) transactions with Positive Quantities
p2 = inv_N & q_pos


# Path3 - Filter population down to include all 
# Cancel (type-C) or Adjust (type-A) transactions 
# with Negative Quanitities
p3 = (inv_A | inv_C) & q_neg


# Path to Leafs: Combine Paths 1, 2 and 3:
# *************** CREATE A COPY ************
retail_pp = retail.loc[p1 & (p2 | p3)].copy()


# Inspect new pre-processed and clean dataset

retail_pp.head()

Number of Orders(Invoices) Visualisations

There is a separate section below referring to Key Performance Indicators whereby we capture, analyse and visualise most of the data. In this section, we would like to present two visuals around the number of order per countries and the number of orders per invoice:

No alt text provided for this image

As we can see most orders are from United Kingdom and while we could have kept data only from UK, we have decided to keep all active data within the Active population without dropping data from other countries.

No alt text provided for this image

In the bottom graph, there is a small tail of data around the 50-100 mark that we would like to explore further. As such we created the top graph, that has zoomed in the top 100 largest orders to visualise this better.

Orders summary

Top graph

There is a small amount of Invoices (less than 200) that contain a large amount of products ordered in a single invoice. e.g.

  1. A single invoice has 1100 products
  2. Another single invoice has 800 products
  3. Four invoices have 700 products
  4. In total, up to 200 number of invoices have more than 100 products each

Bottom graph

Most invoices (24000+) have up to 100 each with the exception of a small population (<200) that have more than 100 products each.


Key Performance Indicators

Transaction KPIs

This involves analysis around customer transaction data grouped by month. i.e. Revenue, Revenue growth, Order numbers, revenue per order, etc...

Monthly Revenue

The formula for calculating the revenue on a monthly basis is shown below:

No alt text provided for this image
  • Calculate Order Value for each row by multiplying Unit Price with Quantity
  • Group by Year and Month using the respective elements of the index of the dataframe
  • Sum Order Values grouped by Year and Month

Monthly Revenue Growth - Month on month percentage comparison of Revenue change.

Monthly Order Count - Count Invoices per grouping.

Average revenue per order - Average monthly revenue above.


Customer KPIs

This involves analysis around number of active customers and segregation between new and old customers from a Revenue perspective:

Active Customers - Unique count of Customer IDs per grouping.

Revenue of New vs Old Customer - Define new feature to distinguish Revenue from New or Existing Customers.


KPI Dataframes

In order to capture the key Monthly KPIs, the main retail_pp DataFrame has been used. The new dataframe was generated as Multi-Index one by:

  1. Grouping retail_pp based on Month and Year and creating an index based on these
  2. Renaming the index levels to Year and Month
  3. Generating and appending each new KPI on the new dataframe
  4. Plotting the KPIs using Line or Graph bars as required
# 1. Revenue - Generate Monthly KPIs DataFrame by aggregating Revenue
m_kpis = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Revenue'].sum())


# 2. Generate Monthly Growth rate based on previous months revenue
m_kpis['MonthlyGrowth'] = m_kpis['Revenue'].pct_change()


# 3. Generate Active Customers
m_kpis['ActiveCustomers'] = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['CustomerID'].nunique())


# 4. Generate Monthly Order Count (Quantity)
m_kpis['MonthlyOrderCount'] = pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Quantity'].sum())


# 5. Gengerate Monthly Order Average
m_kpis['MonthlyOrderAverage'] =  pd.DataFrame(retail_pp.groupby([retail_pp.index.year,retail_pp.index.month])['Revenue'].mean())


# Rename index to capture Year and Month
m_kpis.index.set_names(['Year', 'Month'], inplace = True)

m_kpis

The new multi-index dataframe generated for the monthly KPIs looks like below:

No alt text provided for this image

For the Customer KPIs dataframe, and in order to differentiate between an existing and new customer:

  • Created a new dataframe that captures the first purchase date of each customer grouped by CustomerID
  • Merged the new feature (First Purchase Date) to the existing retail_pp dataframe based on CustomerID
  • Created a new feature (User Type) in retail_pp dataframe based on the the Invoice Date of the current month being before or after the First Purchase date
  • Created Customer KPIs dataframe grouping by Year, Month and UserType and adding the Revenue per month
# Generate new dataframe based on CustomerID and its first purchase date
customer_fist_purchase = retail_pp.groupby('CustomerID').InvoiceDate.min().reset_index()
customer_fist_purchase.columns = ['CustomerID','FirstPurchaseDate']
customer_fist_purchase['FirstPurchaseYearMonth'] = customer_fist_purchase['FirstPurchaseDate'].map(lambda date: 100*date.year + date.month)


# Add first purchase date column to the new Customer Dataframe by merging with the original retail_pp
retail_ppp = pd.merge(retail_pp, customer_fist_purchase, on='CustomerID')
retail_ppp['Date'] = pd.to_datetime(retail_ppp.InvoiceDate.dt.date)
retail_ppp.set_index('Date', inplace=True)
retail_ppp.head()


# Create new column "User Type" and default it to "New" as its values. Assign "Existing" value if 
# User's "FirstPurchaseYearMonth" was before the selected "InvoiceYearMonth"
retail_ppp['UserType'] = 'New'

retail_ppp.loc[retail_ppp['YearMonth']>retail_ppp['FirstPurchaseYearMonth'],'UserType'] = 'Existing'


# Calculate the Revenue per month for each user type
customer_kpis = pd.DataFrame(retail_ppp.groupby([retail_ppp.index.year,retail_ppp.index.month, retail_ppp.UserType])['Revenue'].sum())
customer_kpis.index.set_names(['Year', 'Month', 'UserType'], inplace = True)


customer_kpis

Our new dataframe looks like below:

No alt text provided for this image

Visualisations

Structuring our data in a multi-index dataframe, makes it very useful to start plotting in various forms like line plots or bar plots as shown below.

No alt text provided for this image

As per diagram below, during April there has been a significant drop in Revenue growth which is not part of the typical variation you see during June or July the same year. Also, worth noting that the sudden drop at the end of the graph is due to lack of data in Dec rather than and actual business metric.

# Line Graph to capture the monthly revenue
m_kpis.plot.line(y='Revenue', title='Revenue per month',color = 'r',
         marker='h', 
         markersize=10, linewidth=2,
         markerfacecolor='white');
No alt text provided for this image


No alt text provided for this image

When reviewing the rate of growth bar graph, it confirms the negative drop during April as the biggest spike of negative growth.


# Horizontal Bar Graph to capture monthly revenue growth
m_kpis.plot.barh(y='MonthlyGrowth',title='Monthly Growth', rot=0);
No alt text provided for this image


No alt text provided for this image

Again, the monthly order count graph, shows a consistent picture of the sales of goods: during April there was a significant drop of the Quantities Ordered while the rest of the graph shows an organic growth of the business.

# Horizontal Bar graph to capture number of quantities ordered per month
m_kpis.plot.line(y='MonthlyOrderCount',title='Quantities ordered per Month',color = 'b',
         marker='h', 
         markersize=10, linewidth=2,

         markerfacecolor='white');
No alt text provided for this image


No alt text provided for this image

Finally, the Monthly order graph has been selected to explore the average value of an order and its variation during research window. There seems to be a) a small variation between the orders b) slight increase of the average order c) significant drop in the value of order during April in comparison with the average value in the neighbouring months.

# Horizontal Bar graph to capture number of quantities ordered per month

m_kpis.plot.barh(y='MonthlyOrderAverage',title='Monthly Order Average value');
No alt text provided for this image


No alt text provided for this image

On the Customer KPIs, it's interesting to confirm that this has been the same picture, i.e. organic growth of the customer base with a similar drop in April.

# Line graph to capture the active customers per month
m_kpis.plot.line(y='ActiveCustomers', title='Active customers per month',color = 'g',
         marker='h', 
         markersize=10, linewidth=2,

         markerfacecolor='white');
No alt text provided for this image


No alt text provided for this image

In the diagram below we can see clearly a healthy increase of the existing customer base. Regarding new customers, following a natural decrease after the initial customer acquisition, we can see the Revenue stabilising to a healthy steady amount (i.e. circa ~$100000).

# Plot new vs existing customers

customer_kpis.unstack().plot.bar(y='Revenue', title='New vs Existing CUstomer Monthly Revenue', rot=0);
No alt text provided for this image


Modelling using RFM Analysis

In order to do Customer Segmentation, the RFM modelling technique has been used. RFM stands for Recency - Frequency - Monetary Value with the following definitions:

  1. Recency - Given a current or specific date in the past, Recency captures the last time that the customer made a transaction.
  2. Frequency - Given a specific time window, Frequency captures the number of transactions the customer did during that window.
  3. Monetary Value or Revenue - Given a specific window, Revenue captures the amount spent by the customer.

Based on the values above, one can construct various segments that classify the customers to:

Low value Segment

  • R - Less active
  • F - Not very frequent
  • M - Low spend

Mid value Segment

  • R - Medium or inconsistent activity
  • F - infrequent or frequent at specific times
  • M - varied spending but overall in the mid of the scale

High value Segment

  • R - Very active
  • F - Very frequent
  • M - Great spending profile

Modelling Approach

Going forward the approach for deriving customer segments would be:

  • Calculate RFM scores individually and plot them to get an idea of their RFM values and ranges
  • Use K-Means clustering unsupervised learning algorithm to create these 3 clusters above
  • Use techniques to optimise the number of clusters based on Silhouette and Inertia scores


Recency

To calculate Recency, we first start by creating a new data frame to hold all the RFM scores per customer by grouping per CustomerID.

Then we create another dataframe based on the maximum Invoice Date grouped by CustomerID and fix an observation point (here it's the last invoice day in the dataset) as a line in the sand point.

Finally, we generate the Recency in days by subtracting the Last Invoice Day (line in the sand point) for that customer from the Last Purchase Date:

No alt text provided for this image

As per code below, we merge the two dataframes on CustomerID to create the Recency feature in our new Customer Dataframe.

# Generate new dataframe based on unique CustomerID to keep track of RFM scores
customer = pd.DataFrame(retail_pp['CustomerID'].unique())
customer.columns = ['CustomerID']


# Generate new data frame based on latest Invoice date from retail_ppp dataframe per Customer (groupby = CustomerID)
recency = retail_ppp.groupby('CustomerID').InvoiceDate.max().reset_index()
recency.columns = ['CustomerID','LastPurchaseDate']


# Set observation point as the last invoice date in the dataset
LastInvoiceDate = recency['LastPurchaseDate'].max()


# Generate Recency in days by subtracting the Last Purchase date for each customer from the Last Invoice Date
recency['Recency'] = (LastInvoiceDate - recency['LastPurchaseDate']).dt.days


# Consolidate to customer DataFrame
customer = pd.merge(customer, recency[['CustomerID','Recency']], on='CustomerID')


customer.head()

Frequency

Frequency metric reflects the number of orders per Customer, so a simple count of the invoices grouped per Customer ID would do:

# Count number of invoices per CustomerID and store in new frequency Dataframe
frequency = retail_ppp.groupby('CustomerID').InvoiceDate.count().reset_index()
frequency.columns = ['CustomerID','Frequency']


# Consolidate Frequency to existing Customer DataFrame

customer = pd.merge(customer, frequency, on='CustomerID')

Monetary Value (Revenue)

For Revenue, the grouping goes by CustomerID, and the aggregation function is Sum.

# Revenue per transaction has already been calculated as per KPIs section
# Grouping revenue per Customer ID
revenue = retail_ppp.groupby('CustomerID').Revenue.sum().reset_index()

# Consolidate Revenue to existing Customer DataFrame
customer = pd.merge(customer, revenue, on='CustomerID')


customer.head()

The dataframe now looks like below:

No alt text provided for this image

If we plot the histograms for each of R, F and M values we get the following;

No alt text provided for this image

Note, there are some negative values in the Revenue, but that is explained by some returns or cancellations of products and have been intentionally kept within the active population as long as it falls within the rest criteria set by the masks created.

If we create a Scatter plot of the pairs we can visualise the relationship between the R, F, M variables and see if there are any obvious segments:

No alt text provided for this image

There's clearly some distinction between high value segments and low value ones, but the difficulty in inspecting such scatter plots visually is finding the ideal boundaries and also identifying the middle segment.

K-Means Clustering

For that purpose, we will use the power of K-Means clustering, an unsupervised learning method that can help us identify a specific number of clusters, even without labels.

The steps followed are:

  1. Set (arbitrarily) the number of clusters: here we'll aim for K=3
  2. Normalise the R, F, M values of data using StandardScaler method
  3. Train model and review results (centeroids and assigned clusters)
  4. Visualise results
  5. Calculate evaluation metrics (Inertia & Silhouette score)
  6. Optimise K based on multiple runs for a range of Ks and evaluate using the Elbow graph
  7. Review final customer segments

Here's the code that implements steps 1-3:

# Number of clusters (Configurable) - Initially we're attempting based on the 3 Segments (Low, Mid, High)
k=3


# Scaling input using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Xstd = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)


# Run and train main K-means algorithm based on all input features
from sklearn.cluster import KMeans
model = KMeans(n_clusters=k, random_state=0)
model.fit(Xstd)


# Review output cluster labels
cluster_labels = model.labels_
print("Assigned cluster labels: \n", cluster_labels)


# Review Centroids of clusters
centroids = model.cluster_centers_
print("Location of centroids: ")
print(centroids)


# Append clusters to input features table
Xstd['clusters'] = cluster_labels


Xstd.head()

Now, we can visualise the results and get an initial view of the segments:

No alt text provided for this image

We can now clearly see:

  • Yellow - High value segment
  • Green - Mid value segment
  • Purple - Low value segment

To evaluate the performance of K-Means clustering algorithm we use the following metrics:

  • Inertia value - absolute value. The rule of thumb is:the smaller the number the better the fit.
  • Silhouette Score - range between -1 and 1. Typically, high score indicates that the population is well matched to a cluster
No alt text provided for this image

Onto step 6, we'll try to optimise the results by running k-means for a range of Ks:

inertias = []
ss = []


for k in range(2,11):
    
    kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(Xstd)
    inertias.append(kmeans.inertia_)

    ss.append(metrics.silhouette_score(Xstd, kmeans.labels_, metric='euclidean'))

The elbow graph shows that we're just about right with 3 clusters, though you can also choose 4 or 5 (as a stretch).

No alt text provided for this image

Plotting the Silhouette score for a range of Ks also confirms that while beyond 5 we get a more accurate matching, it means that it probably over-fits and thus not optimal:

No alt text provided for this image

Finally onto step 7, the new Customer Dataframe with the assigned segments an labels looks like below:

No alt text provided for this image

If we plot the 3 different segments we get the following bar plot:

No alt text provided for this image

Business Actions

Once we have a clear view of the Segments above, specific actions can/should be planned across the Business (Operations, Marketing, Product, etc...) to address any potential issues;

  1. High Value: There's a small number of these clients and key priority for the business would be to improve Retention of these customers as they are the most valuable asset.
  2. Mid Value: There's a larger proportion of these customers who should be treated as potentially High Value ones with the right actions: aim to increase Retention and Frequency and bring them closer to the brand and the product so eventually become High Value.
  3. Low Value: the vast majority will be in this category and the first step in addressing this group would be to aim to increase Frequency and understand if there are any potential issues around the product or service.

Summary

To summarise our journey so far in this project, we've followed the steps below:

  1. Data Review & Understanding - Explored the dataset to understand the features and missing values.
  2. Feature Engineering & Processing - Created new features to help us uncover better insights.
  3. KPIs & Visualizations - Created a broad set of KPIs to get a better business view of the data.
  4. RFM analysis & new features - Used RFM analysis to model the data.
  5. Unsupervised learning K-Means clustering - Used unsupervised learning to tell us about the various data clusters.
  6. Optimization & Evaluation - Run K-means clustering for a range of Ks to optimise the algorithm.
  7. Business Actions & Segments Mapping - Translated the results of Machine Learning into Business Actions

Next Steps

As a next step, the work being done here can be leveraged to predict Customer Lifetime Value (CLTV).

This can be done using the RFM features created here as input to a Supervised Machine Learning model with output being the total amount spent by the customer in their lifetime (Revenue).

I hope you've found this article useful, it was certainly a pleasure working through this project.


GitHub

The full Jupyter Notebook for this project can be found here.


References

https://towardsdatascience.com/data-driven-growth-with-python-part-1-know-your-metrics-812781e66a5b

https://towardsdatascience.com/data-driven-growth-with-python-part-2-customer-segmentation-5c019d150444

https://www.datacamp.com/community/tutorials/introduction-customer-segmentation-python


About Angelo

I’m a freelance Principal Management Consultant and applied Data Scientist with broad experience implementing Data Transformation and delivering Technology Change on behalf of Vendors and Global Banks in Financial Services and Government.

Coming from a solid academic background in Robotics Engineering, skills like mathematics, statistics and programming (luckily) came off the shelve for me, also coupled with the genuinely curious, analytical and inquisitive thinking of an Engineer.

This unique blend of technical skills and industry experience have helped me work with a broad range of Business and Technical teams in Technology and Data, helping them bridge the gaps and bring critical projects over the line thus delivering great value add to their organisation.

My interest going forward is to explore how to use data to drive better insights for sales and digital marketing teams, in order to help them retain their existing client base and also expand further to new ones.

Feel free to drop me a line at:

LinkedIn - https://www.dhirubhai.net/in/etzimopoulos/

GitHub - https://github.com/etzimopoulos

Medium - https://medium.com/@etzimopoulos

#customersegmentation #rfmanalysis #datascience #customerretention #digitalmarketing #capstone #machinelearning #managementconsulting #businessconsulting

Thanks so much

James Phoenix

O'Reilly Author | AI Engineer | CTO @Vexpower | Prompt Engineer

4 年

Awesome article, keep up the good work :)

要查看或添加评论,请登录

社区洞察

其他会员也浏览了