Machine Learning fast-track: Telco Customer Churn Prediction

Machine Learning fast-track: Telco Customer Churn Prediction

In this tutorial, we will unleash the powerful analytical capabilities provided by the Python environment!

Our goal will be to analyze the data and utilize machine learning algorithms to predict customer churn. This will be done using data provided by Kaggle, Machine Learning and Data Science Community (https://www.kaggle.com/).

Telco customer churn sample data set (IBM Sample Data Sets) is well known and used resource for practicing data analyses and training machine learning algorithms. https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Sample data about customer churn will provide us an insight into customer behavior, and the results of our analyses will be the key building block of the customer retention program.

In order to do so, the session will guide you through two preparation steps:

  • data preparation that will be supported by charming python “pandas” library for data analysis and
  • data exploration with the help of cool visualizations provided by “matplotlib” package.

After that machine learning platform in python “sklearn” will greet you with open arms.

Who can use this tutorial?

Everybody is welcome. Those with more experience with python and data analytics will have more insight into details, but please don’t be discouraged by program code. Code is your friend. The only prerequisite is an open mind and willingness to learn about the new technology.

Hashtags #python?#dataanalytics #ai #artificialintelligence #ml #machinelearning #datascience #customerchurn # telco #pandas # sklearn #dataanalysis #anaconda


Set your data analytics environment

Install Anaconda distribution for the Python on your windows machine

My recommendation is to use the Anaconda distribution for Python for data analysis and other purposes. Anaconda offers a free and open-source distribution of the Python programming language for scientific computing (data science, machine learning applications, etc.). The main advantage is simplified package management and deployment (over 250 packages automatically installed, and over 7,500 additional open-source packages).

No alt text provided for this image

  • Scroll down to the “Download” section.

No alt text provided for this image

  • Click on the link for the “64-Bit Graphical Installer” version.

This will start the download and that can take a while.

No alt text provided for this image

  • Start the installation by double-clicking on the installation file on your download directory

No alt text provided for this image

  • Install Anaconda Individual Edition with default features.

No alt text provided for this image

Be patient as this can take some time depending on your system performance. A lot of cool packages for data analytics are coming your way! ??

No alt text provided for this image

  • When it is done, click “Next”, “Next” and “Finish” to end the installation process.

No alt text provided for this image
No alt text provided for this image

If you have time, I encourage you to review the documentation, but since this is a fast tutorial, we will not do it now. Let’s better use our data analytics environment to do some serious data processing! ??


Set your data analytics goal

Churn prediction is often used in the context of machine learning. A churn in telco and other subscription-based services means a situation when the customer leaves the service provider. For a telco, churn is one of the most important KPIs and risks, so companies are developing models to prevent it and calculating the likelihood of churn. The better the prediction is, the better the business model's sustainability. Knowing the drivers for churn or probability is crucial for developing the defense plan or understanding reasons for dissatisfaction with service.

From source: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics are included for each customer, as well as a Satisfaction Score, Churn Score, and Customer Lifetime Value (CLTV) index.

Our data analytics project will deal with the following problem:

  • Use telco churn data to predict the behavior of customers.

Results of the analyses can be used to automate decision-making in customer retention programs.


Data preparation

The first step is to load the data:

import pandas as pd

#--------------#
# LOADING DATA #
#--------------#

# https://www.kaggle.com/blastchar/telco-customer-churn

df_source = pd.read_csv('TelcoCustomerChurn.csv')
df = df_source.copy()

print('\n\n-------------------------------')
print(df.info())

describe_df = df.describe()
        
No alt text provided for this image

Only "SeniorCitizen", "tenure" and "MonthlyCharges" are assigned to simple data types.

Other data is interpreted as an "object" type that is not ML friendly.

Analysis of data columns to identify independent and dependent variables:

X is the independent variables - ?the variables we are using to make predictions

  • customerID - unique value identifying customer
  • gender - whether the customer is a male or a female
  • SeniorCitizen - whether the customer is a senior citizen or not (1, 0)
  • Partner - whether the customer has a partner or not (Yes, No)
  • Dependents - whether the customer has dependents or not (Yes, No). A?dependent?is a person who relies on another as a primary source of income,
  • tenure - number of months the customer has stayed with the company
  • PhoneService - whether the customer has a phone service or not (Yes, No)
  • MultipleLines - whether the customer has multiple lines or not (Yes, No, No phone service)
  • InternetService - customer’s internet service provider (DSL, Fiber optic, No)
  • OnlineSecurity - whether the customer has online security or not (Yes, No, No internet service)
  • OnlineBackup - whether the customer has online backup or not (Yes, No, No internet service)
  • DeviceProtection - whether the customer has device protection?or not (Yes, No, No internet service)
  • TechSupport - whether the customer has tech support or not (Yes, No, No internet service)
  • StreamingTV - whether the customer has streaming TV or not (Yes, No, No internet service)
  • StreamingMovies - whether the customer has streaming movies or not (Yes, No, No internet service)
  • Contract - type of contract according to duration (Month-to-month, One year, Two year)
  • PaperlessBilling - bills issued in paperless form (Yes, No)
  • PaymentMethod - payment method used by customer (Electronic check, Mailed check, Credit card (automatic), Bank transfer (automatic))
  • MonthlyCharges - amount of charge for service on monthly bases
  • TotalCharges - cumulative charges for service during subscription (tenure) period

y is dependent variable - variable we are trying to predict or estimate

  • Churn – output value, predict variable

Data cleaning is a recommended first step in any analysis.

Let's drop 'customerID' - no sense in using it for analysis.


# let's drop 'customerID' - no sense in using it for analysis
df = df.drop(['customerID'], axis = 1)

print('\n\n-------------------------------')
print(df.dtypes)
        

Column "TotalCharges" is an object, and we have to convert it to a numeric value (errors='coerce' - invalid parsing will be set as NaN).


# column 'TotalCharges' is object, and we have to convert it to numeric value
# errors='coerce' - invalid parsing will be set as NaN
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')
        

Check for null entries:


#------------------------#
# CHECK FOR NULL ENTRIES #
#------------------------#

print('\n\n------- Check for null entries ------------')
print(df.isnull().sum())

import numpy as np

# after convertion 'TotalCharges' has 11 missing values

tempNaN = df[np.isnan(df['TotalCharges'])]
        

An interesting inconsistency is detected that does not make much sense.

  • Tenure means the number of months the customer has stayed with the company, and the "0" entry does not make sense.
  • Furthermore, in cases where the tenure column has a value of "0", entries for "TotalCharges" are NaN even though the "MonthlyCharges" column is not empty.

We have to find records with tenure = 0 values.


# we have to find records with 'tenure' = 0 values
tempTentureZero = df[df['tenure'] == 0]
        

No additional cases fund.

We generally have two strategies for dealing with null entries:

?1) drop records with tenure = 0 values


# drop records with 'tenure' = 0 values
df.drop(df[df['tenure'] == 0].index, inplace = True)
        

2) or replace missing values with some other values. In that case, using average is a good strategy.

Let’s use the simple approach and delete records with corrupt data.


print('\n\n-------- Null entries are resolved ----------')
print(df.isnull().sum())
        

Encode categorical labels with appropriate numerical values.

Label Encode Binary data:

Independent variables for machine learning algorithms can typically only have numerical values. Label encoding is used for all categorical variables with only two unique values.

Change the data type for categorical data candidates:

No alt text provided for this image

#-------------------------------------------------------------#
# ENCODE CATEGORICAL LABELS WITH APPROPRIATE NUMERICAL VALUES #
#-------------------------------------------------------------#

# Label-Encoding for Categorical Data
# change data type for categorical data candidates
cols = ['gender', 'SeniorCitizen','Partner', 'Dependents', \
??????????? 'PhoneService', 'PaperlessBilling', 'Churn']

df[cols] = df[cols].astype('category')
print('\n\n-------------------------------')
print(df.dtypes)

# label encoding for categorical data candidates
for columns in cols:
??? df[columns] = df[columns].cat.codes
        


Data exploration

The best approach to understanding the given dataset is exploring and visualizing data. The distribution of independent variables will provide us an insight into the patterns in the data and potentially form some hypotheses.

The first step is to import the required tools for data visualization:


import matplotlib.pyplot as plt
import seaborn as sns
  ?        

Histograms are a great way to explore columns with numerical data.


# HISTOGRAMS FOR COLUMNS WITH NUMERICAL DATA

ds_histograms = df[['gender', 'SeniorCitizen', 'Partner', 'Dependents',
?????? 'tenure', 'PhoneService', 'PaperlessBilling',
??????? 'MonthlyCharges']]? ?

fig1 = plt.figure(1, figsize=(15, 12))
plt.suptitle('Histograms for columns with numerical data\n', \
???????????????? horizontalalignment="center",fontstyle = "normal", \
???????????????? fontsize = 24, fontfamily = "sans-serif")

for i in range(ds_histograms.shape[1]):
??? plt.subplot(6, 3, i + 1)
??? f = plt.gca()
??? f.set_title(ds_histograms.columns.values[i])
??? vals = np.size(ds_histograms.iloc[:, i].unique())
??? if vals >= 100:
??????? vals = 100
??? plt.hist(ds_histograms.iloc[:, i], bins=vals, color = '#e20075')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        
No alt text provided for this image

Note: Label-Encoding for gender (customers are 49.5 % female and 50.5 % male).

Female = 0, Male = 1

By reviewing the histograms for numerical variables following can be concluded:

  • there is a relatively equal proportion of male and female
  • most of the customers are members younger population
  • almost half of the customers have a partner, but only a few customers have dependents
  • a review of the "tenure" histogram reveals a lot of new customers with less than 10 months of service usage, and another loyal segment is customers with more than 70 months of service usage
  • customers mostly have phone service, and 75% of customers have paperless billing
  • monthly charges are between 18 and 118 per customer. A lot of customers pay $20

Next good idea is to analyze the payment method.


# ANALYZE PAYMENT METHOD
countPaymentMethod = df.groupby('PaymentMethod')['PaymentMethod'].count()

# show absolute values in 'PaymentMethod' groups
total = int(np.sum(df['PaymentMethod'].count()))

mylabels = ["Bank transfer (automatic)", "Credit card (automatic)", "Electronic check", "Mailed check"]

fig2, ax2 = plt.subplots()

# show absolute values in 'PaymentMethod' groups
ax2.pie(countPaymentMethod, labels=mylabels, autopct=lambda p: '{:.0f}'.format(p * total / 100))

# we will set an equal aspect ratio to place pie in a circle
ax2.axis('equal')
plt.tight_layout()
plt.show()
        
No alt text provided for this image


The data show that most customers like to pay their bills electronically, followed by bank transfers, credit cards, and mailed checks.

Just for the visualization exercise as an optional step, we can also include subplots for service data:

?Version with bar charts:


# SUBPLOTS FOR SERVICE DATA

service_labels = ['MultipleLines', 'InternetService','OnlineSecurity',
??????????????? 'OnlineBackup','DeviceProtection',
??????????????? 'TechSupport','StreamingTV','StreamingMovies']

# bar charts
fig3, axes = plt.subplots(nrows = 2, ncols = 4,figsize = (16,10))
for i, item in enumerate(service_labels):
??? if i < 2:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i,0], rot = 0, color ='#e20074')
??????? ax.set_title(item)
??? elif i >=2 and i < 4:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-2,1], rot = 0,color ='#c8b45a')
??????? ax.set_title(item)
??? elif i >=4 and i < 6:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-4,2],rot = 0,color = '#00a8e6')
??????? ax.set_title(item)
??? elif i < 8:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-6,3],rot = 0,color = '#ecccbf')
??????? ax.set_title(item)
        
No alt text provided for this image

Version with pie charts:


# pie charts
fig4, axes2 = plt.subplots(nrows = 2, ncols = 4,figsize = (16,10))

for i, item in enumerate(service_labels):
??? if i < 2:
??????? ax1 = plt.subplot2grid((2,4), (i,0))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i >=2 and i < 4:
??????? ax1 = plt.subplot2grid((2,4), (i-2,1))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i >=4 and i < 6:
??????? ax1 = plt.subplot2grid((2,4), (i-4,2))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i < 8:
??????? ax1 = plt.subplot2grid((2,4), (i-6,3))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
        
No alt text provided for this image

A very useful data exploration method is also to check the correlation between variables.

Rule of Thumb for Interpreting the Size of a Correlation Coefficient:

No alt text provided for this image
No alt text provided for this image

a)??????Correlation between all variables


# Correlation between all variables
plt.figure(5, figsize=(25, 10))
corr = df.apply(lambda x: pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)
        
No alt text provided for this image

Not much new information can be concluded here. Relations between data is logical and expected. The strongest correlations are on the services level between perquisites for service (e.g., PhoneService and MultipleLines, InternetServices and related additional Internet base services) and between Internet services themselves (probably part of the same service package).

b)?????Correlation between churn and selected boolean and numeric variables


# Correlation between churn and selected boolean and numeric variables
plt.figure(6)
ds_corr = df[['SeniorCitizen', 'Partner', 'Dependents',
?????? 'tenure', 'PhoneService', 'PaperlessBilling',
??????? 'MonthlyCharges', 'TotalCharges']]

correlations = ds_corr.corrwith(df.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#e20074',
??????? rot = 45, grid = True)

plt.title('Correlation with Churn Rate \n', horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif")
        
No alt text provided for this image

Here we can make more interesting conclusions!

  • There is a positive correlation between churn and the age of customers - most senior citizens churn. Maybe there is some campaign by competitors targeting the senior population.
  • Logically, longer tenure could also mean more loyalty and less churn risk.
  • It is also logical that more monthly charges can result in more churn risk.
  • However, it is interesting that total charges show a negative correlation to churn. The explanation can be that total charges also depend on the time the customer has spent with a company (tenure has a negative correlation). Also, it is questionable whether TotalCharges is an adequate variable to understand customer behavior and is it tracked by the customer.
  • A positive correlation between paperless billing and churn is something that needs extra exploring (not clear what can be divers for that behavior).


Hot encoding for categorical data

Before we continue, the additional transformation of data is needed. In the previous step, we used Label Encoding Binary data. That prepared independent variables with only numerical values for machine learning algorithms.

As we explained, Machine Learning algorithms require numerical values for their independent variables. We will introduce dummy columns for independent variables that have categorical data with more than two unique values.


#-----------------------------------#
# HOT ENCODING FOR CATEGORICAL DATA #
#-----------------------------------#

# First we will copy data to new 'dataset' variable to conserve original values
dataset = df.copy()

# Hot-Encoding for categorical data
dataset = pd.get_dummies(dataset)
        

Resulting columns with binary values (0 or 1):

No alt text provided for this image

With new columns generated, we can further check correlations with churn.

Correlation: Contract type vs. Churn:


# Correlation: Contract type vs. Churn
plt.figure(7)

ds_contract_type_corr = \
??? dataset[['Contract_Month-to-month', 'Contract_One year', 'Contract_Two year']]

correlations = ds_contract_type_corr.corrwith(dataset.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#c8b45a',
??????? rot = 45, grid = True)

plt.title('Correlation: Contract type vs. Churn \n')
        
No alt text provided for this image

Month-to-month type of subscription is most exposed to a churn risk. Longer contract duration is a good churn prevention mechanism.

Correlation: Payment method vs. Churn


# Correlation: PaymentMethod vs. Churn
plt.figure(8)

ds_payment_method_corr = \
??? dataset[['PaymentMethod_Bank transfer (automatic)', \
????????? 'PaymentMethod_Credit card (automatic)', \
????????? 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']]

correlations = ds_payment_method_corr.corrwith(dataset.Churn)
correlations = correlations[correlations!=1]

correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#00a8e6',
??????? rot = 45, grid = True)

plt.title('Correlation: Payment method vs. Churn \n')
        
No alt text provided for this image

Reasons for a positive correlation between electronic check as a payment method and Churn have to be investigated.

Multicollinearity

A high correlation between two or more independent variables leads to a phenomenon in data science called multicollinearity. In other words, an independent variable can be predicted from another independent variable. Variables with high multicollinearity are redundant, and they make it hard to interpret the model and could create an overfitting problem.

VIF (Variable Inflation Factors) is a great tool to check multicollinearity. VIF determines the strength of the correlation of a variable with a group of other independent variables in a dataset. VIF starts at 1, and if the 10 value is exceeded, that will indicate high multicollinearity between the independent variables.

Note: Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets (https://www.investopedia.com/terms/o/overfitting.asp).


#--------------------#
# MULTICOLLINEARITY? #
#--------------------#

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(X):
# Calculating Variable Inflation Factors
??? vif = pd.DataFrame()
??? vif["variables"] = X.columns
??? vif["Variable Inflation Factors"] = [variance_inflation_factor(X.values, i)
??? for i in range(X.shape[1])]
??? return(vif)

ds_vif = dataset[['gender', 'SeniorCitizen', 'Partner', 'Dependents',\
??????????????????? 'tenure', 'PhoneService', 'PaperlessBilling', \
??????????????????? 'MonthlyCharges','TotalCharges']]

vif = calculate_vif(ds_vif)
        
No alt text provided for this image

We can notice that the features "Monthly Charges" and "Total Charges" have a high VIF value. Let's use a scatter graph and plot "Monthly Charges" and "Total Charges" values to check how they correlate.


plt.figure(9)
ds_vif[['MonthlyCharges', 'TotalCharges']]\
??????????? .plot.scatter(figsize = (15, 10),
????????????????? x ='MonthlyCharges',
? ????????????????y='TotalCharges',
????????????????? color = '#e20074')

plt.title('Monthly Charges vs. Total Charges collinearity \n')
        
No alt text provided for this image

From the scatter graph, we can see that variables (features) "TotalCharges" and"Monthly Charges" are collinear. Dropping one of those features will reduce the multicollinearity between correlated features. The best approach is to drop the "Total Charges" feature and keep the "Monthly Charges" variable due to its high positive correlation with Churn.


# we will drop 'TotalCharges' from VIF test dataset
ds_vif2 = ds_vif.drop(columns = "TotalCharges")

# check colinearity again
vif2 = calculate_vif(ds_vif2)
        
No alt text provided for this image

Dropping the "Total Charges" variable reduced the multicollinearity between correlated features in the test dataset (including "tenure").

Next, "Total Charges" must also be dropped from the main dataset used for Machine Learning algorithms in the final stage of our analysis.


# **** drop the "Total Charges" from main dataset ****
dataset = dataset.drop(columns = "TotalCharges")

print("\n\n-------------------------- dataset -------------------------")
print(dataset.dtypes)
        
No alt text provided for this image

We have all numeric values, and we are ready for the next step!


Machine Learning

There are two main types of machine learning problems:

  • Supervised learning
  • Unsupervised learning

SUPERVISED LEARNING

It is done on a set of historical data points which we want to use to predict the future. Further categorization of supervised learning:

  • Classification problems - predict a discrete set of values or categories.
  • Regression - predict a continuous scale (Linear and Nonlinear Regression)

The classification problem in machine learning is a predictive modeling problem. That refers to the type of problem where a class label is predicted for a provided sample of input data (e.g., recognize a handwritten character and classify it as a known character, check if the mail is spam).

On another side, the regression problem refers to predicting a continuous quantity output for a given sample of data (e.g., predict the house price based on the size of the house, availability of schools in the area, and other essential factors, predict sales revenue based on historical sales data).

The target variable for the churn has two states: yes or no / 1 or 0

This is a binary classification problem!

UNSUPERVISED LEARNING

Used to resolve problems for which we have little or no idea what the results should be. The algorithm is looking for hidden features of data and clustering the data most sensible way based on the available data (Neural Networks - remember chatbot LEX session).

Split the dataset into dependent and independent variables

Installing scikit-learn:


pip install -U scikit-learn
        

The next step is to split the dataset into X and Y values.

y = ‘Churn’ column (response)

X = other independent variables in the dataset.


# Split the dataset into dependent and independent variables
X = dataset.drop(columns = ['Churn'])
y = dataset['Churn'].values

from sklearn.model_selection import train_test_split

# "train_test_split" will split arrays or matrices into random train and test subsets.

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42, stratify=y)
        

“train_test_split” parameters description:

You should provide either "train_size" or "test_size". The default share of the dataset that will be used for testing is 0.25 or 25%.

"random_state" sets the seed to the random generator (splits deterministic). If the seed is not set, it is different each time. Why 42? No reason. Read Douglas Adams' "The Hitchhiker's Guide to the Galaxy". 42 is the number from which all meaning ("life, the universe, and everything") could be derived.

"stratify" parameter is used to resolve the imbalance in the sample. Stratify parameter ensures that a split, in the proportion of values in the sample, is the same as the proportion of values provided to parameter stratify. For example, if variable y is a categorical variable with values 0 and 1 with 20% of zeros and 80% of ones, stratify=y will ensure that a random split has 20% of zeros and 80% of ones.

ML predictions through various models and algorithms


# Let’s introduce a DataFrame for comparison of ML algorithms.
model_comparison = pd.DataFrame(columns=['Model','Accuracy','Execution time'])
        


1) LOGISTIC REGRESSION

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.


# LOGISTIC REGRESSION
from sklearn.linear_model import LogisticRegression

logistic_regression_model = LogisticRegression()

# mesuremnet of execution time
import time
t0 = time.time()
logistic_regression_model.fit(X_train, y_train)
t1 = time.time()
        

Evaluation of model Accuracy:


accuracy_logistic_regression = logistic_regression_model.score(X_test,y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Logistic Regression: ", accuracy_logistic_regression)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
        

-----------------------------------------------------------

Accuracy of Logistic Regression:?0.8004739336492891

Execution time: 0.12232900 seconds

-----------------------------------------------------------

Additionally, we can introduce classification_report:


from sklearn.metrics import confusion_matrix, classification_report

logistic_regression_prediction = logistic_regression_model.predict(X_test)
logistic_regression_report = classification_report(y_test, logistic_regression_prediction)

print(logistic_regression_report)
        
No alt text provided for this image

Metrics Definition

  • Precision - is defined as the ratio of true positives to the sum of true and false positives.
  • Recall - defined as the ratio of true positives to the sum of true positives and false negatives.
  • F1 Score - is the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to 1.0, the better the expected performance of the model is.
  • Support - is the number of actual occurrences of the class in the dataset. It doesn’t vary between models; it just diagnoses the performance evaluation process.

Confusion matrix

A confusion matrix is a table used to describe the performance of a classification model on a data set used for test and for which the true values are known.


plt.figure(10)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, logistic_regression_prediction),
??????????????? annot=True, fmt="d", linecolor="k", linewidths=3)

plt.title("Logistic Regression Confusion Matrix", fontsize=16)
plt.show()
        
No alt text provided for this image

As a final step of using the Logistic Regression algorithm, results are entered into DataFrame for model comparison.


model_comparison = model_comparison.append(
??? {'Model': 'Logistic Regression',
???? 'Accuracy': accuracy_logistic_regression,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
        


2)?DECISION TREE

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems.

A decision tree simply asks a question and based on the answer (Yes/No), it further split the tree into subtrees. It is a fast technique but lacks accuracy.


# DECISION TREE

from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier()

t0 = time.time()
decision_tree_model.fit(X_train,y_train)
t1 = time.time()
?
accuracy_decision_tree = decision_tree_model.score(X_test, y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Decision Tree: ", accuracy_decision_tree)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")

# Decision Tree Classifier gives very low accuracy score.

decision_tree_prediction = decision_tree_model.predict(X_test)

plt.figure(11)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, decision_tree_prediction),
??? ????????????annot=True, fmt="d", linecolor="k", linewidths=3)

plt.title("Decision Tree Classifier Confusion Matrix", fontsize=16)
plt.show()

model_comparison = model_comparison.append(
??? {'Model': 'Decision Tree Classifier',
???? 'Accuracy': accuracy_decision_tree,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
        


3)?RANDOM FOREST

The Random Forest classifier contains more decision trees on various subsets of the given dataset. The predictive accuracy of that dataset is improved by taking the prediction from every tree. Bulk prediction votes of every random tree in the forest are processed and the ultimate output is given.


# RANDOM FOREST

from sklearn.ensemble import RandomForestClassifier

random_forest_model = RandomForestClassifier(n_estimators=500,
?????????????????????? ???????????oob_score = True, n_jobs = -1,
????????????????????????????????? random_state=42, max_features = "auto",
????????????????????????????????? max_leaf_nodes = 30)

t0 = time.time()
random_forest_model.fit(X_train, y_train)
t1 = time.time()

accuracy_random_forest = random_forest_model.score(X_test, y_test)

print("\n\n-----------------------------------------------------------")
print("Accuracy of Random Forest: ", accuracy_random_forest)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")

random_forest_prediction = random_forest_model.predict(X_test)

plt.figure(12)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, random_forest_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)

plt.title("Random Forest Classifier Confusion Matrix", fontsize=16)
plt.show()

model_comparison = model_comparison.append(
??? {'Model': 'Random Forest Classifier',
???? 'Accuracy': accuracy_random_forest,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
        


4)?SUPPORT VECTOR MACHINE (SVC)

Support Vector Machine (SVM) is a popular Supervised Learning algorithm used for classification and regression problems. Primarily usage is for Classification problems in Machine Learning. The SVM algorithm creates the best fit line or decision boundary to segregate n-dimensional space into classes. This approach can quickly put the new data point in the correct category in the future. Hyperplane relates to the best decision boundary for segregating n-dimensional space into classes.


# SUPPORT VECTOR MACHINE

from sklearn.svm import SVC

svc_model = SVC(random_state = 42)

t0 = time.time()
svc_model.fit(X_train,y_train)
t1 = time.time()

accuracy_svc = svc_model.score(X_test,y_test)

print("\n\n-----------------------------------------------------------")
print("Accuracy of Support Vector Machine: ", accuracy_svc)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")

svc_prediction = svc_model.predict(X_test)

plt.figure(13)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, svc_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)


plt.title("Support Vector Machine Confusion Matrix", fontsize=16)
plt.show()

model_comparison = model_comparison.append(
??? {'Model': 'Support Vector Machine',
???? 'Accuracy': accuracy_svc,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
        


5)?K-NEAREST NEIGHBOR (KNN)

K-Nearest Neighbor (KNN) is a classification algorithm used for assigning a class to a new data point. K is an integer value specified by the user, and the classifier determines the class of a data point by the majority voting principle.

For example, for K=4, the algorithm checks classes of 4 closest points, and the majority class will determine the prediction.


# K-NEAREST NEIGHBOR (KNN)

from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors = 10)

t0 = time.time()
knn_model.fit(X_train,y_train)
t1 = time.time()

accuracy_knn = knn_model.score(X_test,y_test)

print("\n\n-----------------------------------------------------------")
print("Accuracy of K-Nearest Neighbor: ", accuracy_knn)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")

knn_prediction = knn_model.predict(X_test)

plt.figure(14)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, knn_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)

plt.title("K-Nearest Neighbor Confusion Matrix", fontsize=16)
plt.show()

model_comparison = model_comparison.append(
??? {'Model': 'K-Nearest Neighbor',
???? 'Accuracy': accuracy_knn,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
        


Model comparison

No alt text provided for this image

Our model comparison was made on only two simple measures: ”Accuracy” and ”Execution time.” More complex metrics can be developed, and model performance can significantly improve with hypertuning of parameters in ML Models.

Even when we look at the results, it is clear that some models provide optimal accuracy and execution time. In some situations, we need high accuracy, and time is not a factor. In other cases, fast decision-making is required, and “false positives” are not a problem (e.g., search engines).

This led us to the end of this tutorial!

Now you have firsthand experience with some basics of machine learning, and you learned some cool Python tricks.

Start exploring the data science universe!


Neven Dujmovi?, April 2022


?

#python?#dataanalytics #ai #artificialintelligence #ml #machinelearning #datascience #customerchurn # telco #pandas # sklearn #dataanalysis #anaconda

Dragan Gudeljevi?

Data Protection Officer@HT Group (Hrvatski Telekom, Combis, HT Servisi) | CIPP/E

2 年

Great article! You sure do have a way of motivating people to learn new things. Keep up the good work!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了