AI/ML : Customers Behavior Analysis To Increase Subscriptions for Financial App...From Scratch [Part 1]
Prateek Sharma
Manager Risk & Fraud BMS || Revenue Assurance Consultant || Thinker | Solving Problems with AI ML || Ex- Airtel || Ex- Subexian || eMDP IIMK
AI/ML is without a doubt hot topic of the town and it definitely make sense as soon as we start to understand the capabilities of this in detail. The analysis can create wonders for business owners to predict the exact values which is far from our thinking capabilities. Here would like to explain one of the common problems considering financial application data and predict the customer who is going to subscriber a premium version app subscription or not. Accordingly the company should take action on the customers to give the offers or not. The data contain the customer’s behavior and our job to find the insights from it.
So lets begin...
Problem Statement :- Fintech company has launched an application which can be used in multiple purpose such as loan, savings, payments etc.. The application has two versions(Free and Paid). The goal of the company is to sell the paid version to the target market to save the extra cost of marketing. That’s a reason they are provided the premium feature in the free version app for 24 hours to collect the customer’s behavior. After that, the company hired the Machine Learning Engineer to find insight from the collected data (customer’s behavior).
If the customers will buy a product anyway so no need to give an offer to that customer Only give offers to those customers who are interested to use paid version app but they can’t afford its cost.
We will use python in order to perform the analysis with Jupiter notebook.
- Get your tools ready :- So let's import the essential libraries
import numpy as np # for numeric calculation import pandas as pd # for data analysis and manipulation import matplotlib.pyplot as plt # for data visualization import seaborn as sns # for data visualization from dateutil import parser # convert time in date time data type
2. Lets explore the data :-
fineTech_appData = pd.read_csv("FineTech_appData.csv") # Load the data fineTech_appData.shape # get shape of dataset Output :- (50000, 12)
So the dataset holds 50000 rows and 12 columns, lets see how it looks like
fineTech_appData.head(5) # show fisrt 5 rows of fineTech_appData DataFrame
The 6th column’s (screen_list) full information is not visible, so lets see what this column have in detail.
for i in [1,2,3,4,5]: print(fineTech_appData.loc[i,'screen_list'],'\n') #We print only 5 rows from index 1 to 5 from the screen_list
Great...Now lets not forget the first rule of data validation. Let's identify the values with null.
fineTech_appData.isnull().sum() # To check the null values fineTech_appData.info() # To get the information about the dataset
So we can see the enrolled_date has 18926 null values and rest all are okay.
fineTech_appData.describe() # To see the distribution
Okay now lets only take the important columns and remove the rest.
fineTech_appData2 = fineTech_appData.drop(['user', 'first_open', 'screen_list', 'enrolled_date'], axis = 1) #To Drop non usable columns
Ohh yeah one more thing before we forget lets quickly change the data type of the hour column ( From 02:00:00 > 2 ) makes more sense for data analysis.
fineTech_appData['hour'] = fineTech_appData.hour.str.slice(1,3).astype(int) #changing the data time to integer
Hmmm looks better now...So time see some visualization as mind understand it better.
3. Data visualization :-
1. Heatmap with correlation matrix :-
plt.figure(figsize=(14,6)) # Plot figure with given size sns.heatmap(fineTech_appData2.corr(), annot = True, cmap ='coolwarm') # Correlation with
annot – an array of same shape as data which is used to annotate the heatmap
cmap – a matplotlib colormap name or object. This maps the data values to the color space.
n the fineTech_appData2 dataset, there is no strong correlation between any features. There is little correlation between ‘numscreens’ and ‘enrolled’. It means that those customers saw more screen they are taking premium app.
2. Count Plot :-
sns.countplot(fineTech_appData.enrolled)
This suggests more enrolled users compared to the not enrolled users in our data set.
3. Histogram of each feature:-
plt.figure(figsize = (14,7)) features = fineTech_appData2.columns # list of columns name for i,j in enumerate(features) plt.subplot(3,3,i+1) # create subplot for histogram plt.title("Histogram of {}".format(j), fontsize = 12) # title of histogram bins = len(fineTech_appData2[j].unique()) # bins for histogram plt.hist(fineTech_appData2[j], bins = bins, rwidth = 0.8, edgecolor = "y", linewidth = 2, ) # plot histogram plt.subplots_adjust(hspace=0.5) # space between horixontal axes (subplots)
Histogram explains minigame, used_primium_feature, enrolled, and like they have only two values.
The histogram of ‘dayofweek’ shows, on Tuesday and Wednesday slightly fewer customer registered the app.
The histogram of ‘hour’ shows the less customer register on the app around 10 AM.
The ‘age’ histogram shows, the maximum customers are younger.
The ‘numsreens’ histogram shows the few customers saw more than 40 screens.
Now lets find out how much time a customer takes to get enrolled in premium feature app after registration, for that we need first open date and enrolled date(Which we dropped earlier).
Parsing first open date and enrolled date in date and time format for easy subtraction.
fineTech_appData['first_open'] =[parser.parse(i) for i in fineTech_appData['first_open']] # Creating a list fineTech_appData['enrolled_date'] =[parser.parse(i) if isinstance(i, str) else i for i in fineTech_appData['enrolled_date']]
Now Subtract and plot to see the time..
fineTech_appData['time_to_enrolled'] = (fineTech_appData.enrolled_date - fineTech_appData.first_open).astype('timedelta64[h]') plt.hist(fineTech_appData['time_to_enrolled'].dropna(), range = (0,100))
So this shows 0 to 10 hours from the time of registration.
We have got enough information for our analysis but we missed one important information which would be very helpful here is "Screen List". This will help us to identify what is the user behavior and how much time he/she is spending on the screen.
We have picked all distinct screen list in another CSV file and will append and make columns of these screen list values and then check this screen name is available in ‘screen_list’ if it is available then add value 1 else 0 in the appended column.
# string into to number for screen_name in fineTech_app_screen_data: fineTech_appData[screen_name] = fineTech_appData.screen_list.str.contains(screen_name).astype(int) fineTech_appData['screen_list'] = fineTech_appData.screen_list.str.replace(screen_name+",", "") fineTech_appData.shape Output (50000, 68)
So after all the cleaning of the data now this is how usable final data will look like..
Now this data can be utilized in order to run machine learning algorithms by splitting the data set into train and testing. Majority of the time of an ML data analyst goes into correcting the data set in order to make sure , the data gives clear picture without any problem.
In the next continued blog we will approach the following ML model to see which one gives more accurate results.
- Decision Tree Classifier
- Nearest Neighbor Classifier
- Naive Bayes Classifier
- Random Forest Classifier
- Logistic Regression
- Support Vector Classifier
- XG Boost Classifier
So Stay Tuned.......