Machine Learning fast-track: Telco Customer Churn Prediction
In this tutorial, we will unleash the powerful analytical capabilities provided by the Python environment!
Our goal will be to analyze the data and utilize machine learning algorithms to predict customer churn. This will be done using data provided by Kaggle, Machine Learning and Data Science Community (https://www.kaggle.com/).
Telco customer churn sample data set (IBM Sample Data Sets) is well known and used resource for practicing data analyses and training machine learning algorithms. https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Sample data about customer churn will provide us an insight into customer behavior, and the results of our analyses will be the key building block of the customer retention program.
In order to do so, the session will guide you through two preparation steps:
After that machine learning platform in python “sklearn” will greet you with open arms.
Who can use this tutorial?
Everybody is welcome. Those with more experience with python and data analytics will have more insight into details, but please don’t be discouraged by program code. Code is your friend. The only prerequisite is an open mind and willingness to learn about the new technology.
Hashtags #python?#dataanalytics #ai #artificialintelligence #ml #machinelearning #datascience #customerchurn # telco #pandas # sklearn #dataanalysis #anaconda
Set your data analytics environment
Install Anaconda distribution for the Python on your windows machine
My recommendation is to use the Anaconda distribution for Python for data analysis and other purposes. Anaconda offers a free and open-source distribution of the Python programming language for scientific computing (data science, machine learning applications, etc.). The main advantage is simplified package management and deployment (over 250 packages automatically installed, and over 7,500 additional open-source packages).
This will start the download and that can take a while.
Be patient as this can take some time depending on your system performance. A lot of cool packages for data analytics are coming your way! ??
If you have time, I encourage you to review the documentation, but since this is a fast tutorial, we will not do it now. Let’s better use our data analytics environment to do some serious data processing! ??
Set your data analytics goal
Churn prediction is often used in the context of machine learning. A churn in telco and other subscription-based services means a situation when the customer leaves the service provider. For a telco, churn is one of the most important KPIs and risks, so companies are developing models to prevent it and calculating the likelihood of churn. The better the prediction is, the better the business model's sustainability. Knowing the drivers for churn or probability is crucial for developing the defense plan or understanding reasons for dissatisfaction with service.
From source: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics are included for each customer, as well as a Satisfaction Score, Churn Score, and Customer Lifetime Value (CLTV) index.
Our data analytics project will deal with the following problem:
Results of the analyses can be used to automate decision-making in customer retention programs.
Data preparation
The first step is to load the data:
import pandas as pd
#--------------#
# LOADING DATA #
#--------------#
# https://www.kaggle.com/blastchar/telco-customer-churn
df_source = pd.read_csv('TelcoCustomerChurn.csv')
df = df_source.copy()
print('\n\n-------------------------------')
print(df.info())
describe_df = df.describe()
Only "SeniorCitizen", "tenure" and "MonthlyCharges" are assigned to simple data types.
Other data is interpreted as an "object" type that is not ML friendly.
Analysis of data columns to identify independent and dependent variables:
X is the independent variables - ?the variables we are using to make predictions
y is dependent variable - variable we are trying to predict or estimate
Data cleaning is a recommended first step in any analysis.
Let's drop 'customerID' - no sense in using it for analysis.
# let's drop 'customerID' - no sense in using it for analysis
df = df.drop(['customerID'], axis = 1)
print('\n\n-------------------------------')
print(df.dtypes)
Column "TotalCharges" is an object, and we have to convert it to a numeric value (errors='coerce' - invalid parsing will be set as NaN).
# column 'TotalCharges' is object, and we have to convert it to numeric value
# errors='coerce' - invalid parsing will be set as NaN
df['TotalCharges'] = pd.to_numeric(df.TotalCharges, errors='coerce')
Check for null entries:
#------------------------#
# CHECK FOR NULL ENTRIES #
#------------------------#
print('\n\n------- Check for null entries ------------')
print(df.isnull().sum())
import numpy as np
# after convertion 'TotalCharges' has 11 missing values
tempNaN = df[np.isnan(df['TotalCharges'])]
An interesting inconsistency is detected that does not make much sense.
We have to find records with tenure = 0 values.
# we have to find records with 'tenure' = 0 values
tempTentureZero = df[df['tenure'] == 0]
No additional cases fund.
We generally have two strategies for dealing with null entries:
?1) drop records with tenure = 0 values
# drop records with 'tenure' = 0 values
df.drop(df[df['tenure'] == 0].index, inplace = True)
2) or replace missing values with some other values. In that case, using average is a good strategy.
Let’s use the simple approach and delete records with corrupt data.
print('\n\n-------- Null entries are resolved ----------')
print(df.isnull().sum())
Encode categorical labels with appropriate numerical values.
Label Encode Binary data:
Independent variables for machine learning algorithms can typically only have numerical values. Label encoding is used for all categorical variables with only two unique values.
Change the data type for categorical data candidates:
#-------------------------------------------------------------#
# ENCODE CATEGORICAL LABELS WITH APPROPRIATE NUMERICAL VALUES #
#-------------------------------------------------------------#
# Label-Encoding for Categorical Data
# change data type for categorical data candidates
cols = ['gender', 'SeniorCitizen','Partner', 'Dependents', \
??????????? 'PhoneService', 'PaperlessBilling', 'Churn']
df[cols] = df[cols].astype('category')
print('\n\n-------------------------------')
print(df.dtypes)
# label encoding for categorical data candidates
for columns in cols:
??? df[columns] = df[columns].cat.codes
Data exploration
The best approach to understanding the given dataset is exploring and visualizing data. The distribution of independent variables will provide us an insight into the patterns in the data and potentially form some hypotheses.
The first step is to import the required tools for data visualization:
import matplotlib.pyplot as plt
import seaborn as sns
?
Histograms are a great way to explore columns with numerical data.
# HISTOGRAMS FOR COLUMNS WITH NUMERICAL DATA
ds_histograms = df[['gender', 'SeniorCitizen', 'Partner', 'Dependents',
?????? 'tenure', 'PhoneService', 'PaperlessBilling',
??????? 'MonthlyCharges']]? ?
fig1 = plt.figure(1, figsize=(15, 12))
plt.suptitle('Histograms for columns with numerical data\n', \
???????????????? horizontalalignment="center",fontstyle = "normal", \
???????????????? fontsize = 24, fontfamily = "sans-serif")
for i in range(ds_histograms.shape[1]):
??? plt.subplot(6, 3, i + 1)
??? f = plt.gca()
??? f.set_title(ds_histograms.columns.values[i])
??? vals = np.size(ds_histograms.iloc[:, i].unique())
??? if vals >= 100:
??????? vals = 100
??? plt.hist(ds_histograms.iloc[:, i], bins=vals, color = '#e20075')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
Note: Label-Encoding for gender (customers are 49.5 % female and 50.5 % male).
Female = 0, Male = 1
By reviewing the histograms for numerical variables following can be concluded:
Next good idea is to analyze the payment method.
# ANALYZE PAYMENT METHOD
countPaymentMethod = df.groupby('PaymentMethod')['PaymentMethod'].count()
# show absolute values in 'PaymentMethod' groups
total = int(np.sum(df['PaymentMethod'].count()))
mylabels = ["Bank transfer (automatic)", "Credit card (automatic)", "Electronic check", "Mailed check"]
fig2, ax2 = plt.subplots()
# show absolute values in 'PaymentMethod' groups
ax2.pie(countPaymentMethod, labels=mylabels, autopct=lambda p: '{:.0f}'.format(p * total / 100))
# we will set an equal aspect ratio to place pie in a circle
ax2.axis('equal')
plt.tight_layout()
plt.show()
The data show that most customers like to pay their bills electronically, followed by bank transfers, credit cards, and mailed checks.
Just for the visualization exercise as an optional step, we can also include subplots for service data:
?Version with bar charts:
# SUBPLOTS FOR SERVICE DATA
service_labels = ['MultipleLines', 'InternetService','OnlineSecurity',
??????????????? 'OnlineBackup','DeviceProtection',
??????????????? 'TechSupport','StreamingTV','StreamingMovies']
# bar charts
fig3, axes = plt.subplots(nrows = 2, ncols = 4,figsize = (16,10))
for i, item in enumerate(service_labels):
??? if i < 2:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i,0], rot = 0, color ='#e20074')
??????? ax.set_title(item)
??? elif i >=2 and i < 4:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-2,1], rot = 0,color ='#c8b45a')
??????? ax.set_title(item)
??? elif i >=4 and i < 6:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-4,2],rot = 0,color = '#00a8e6')
??????? ax.set_title(item)
??? elif i < 8:
??????? ax = df[item].value_counts().plot(kind = 'bar', ax=axes[i-6,3],rot = 0,color = '#ecccbf')
??????? ax.set_title(item)
Version with pie charts:
# pie charts
fig4, axes2 = plt.subplots(nrows = 2, ncols = 4,figsize = (16,10))
for i, item in enumerate(service_labels):
??? if i < 2:
??????? ax1 = plt.subplot2grid((2,4), (i,0))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i >=2 and i < 4:
??????? ax1 = plt.subplot2grid((2,4), (i-2,1))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i >=4 and i < 6:
??????? ax1 = plt.subplot2grid((2,4), (i-4,2))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
??? elif i < 8:
??????? ax1 = plt.subplot2grid((2,4), (i-6,3))
??????? labels = df.groupby(item).agg('count').index.tolist()
??????? plt.pie(df[item].value_counts(), autopct='%.0f%%')
??????? plt.legend(labels, loc='lower left', bbox_to_anchor=(0.0, -0.2))
??????? ax1.set_title(item)
A very useful data exploration method is also to check the correlation between variables.
Rule of Thumb for Interpreting the Size of a Correlation Coefficient:
a)??????Correlation between all variables
# Correlation between all variables
plt.figure(5, figsize=(25, 10))
corr = df.apply(lambda x: pd.factorize(x)[0]).corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap='coolwarm', vmin=-1, vmax=1)
Not much new information can be concluded here. Relations between data is logical and expected. The strongest correlations are on the services level between perquisites for service (e.g., PhoneService and MultipleLines, InternetServices and related additional Internet base services) and between Internet services themselves (probably part of the same service package).
b)?????Correlation between churn and selected boolean and numeric variables
# Correlation between churn and selected boolean and numeric variables
plt.figure(6)
ds_corr = df[['SeniorCitizen', 'Partner', 'Dependents',
?????? 'tenure', 'PhoneService', 'PaperlessBilling',
??????? 'MonthlyCharges', 'TotalCharges']]
correlations = ds_corr.corrwith(df.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#e20074',
??????? rot = 45, grid = True)
plt.title('Correlation with Churn Rate \n', horizontalalignment="center", fontstyle = "normal", fontsize = "22", fontfamily = "sans-serif")
Here we can make more interesting conclusions!
领英推荐
Hot encoding for categorical data
Before we continue, the additional transformation of data is needed. In the previous step, we used Label Encoding Binary data. That prepared independent variables with only numerical values for machine learning algorithms.
As we explained, Machine Learning algorithms require numerical values for their independent variables. We will introduce dummy columns for independent variables that have categorical data with more than two unique values.
#-----------------------------------#
# HOT ENCODING FOR CATEGORICAL DATA #
#-----------------------------------#
# First we will copy data to new 'dataset' variable to conserve original values
dataset = df.copy()
# Hot-Encoding for categorical data
dataset = pd.get_dummies(dataset)
Resulting columns with binary values (0 or 1):
With new columns generated, we can further check correlations with churn.
Correlation: Contract type vs. Churn:
# Correlation: Contract type vs. Churn
plt.figure(7)
ds_contract_type_corr = \
??? dataset[['Contract_Month-to-month', 'Contract_One year', 'Contract_Two year']]
correlations = ds_contract_type_corr.corrwith(dataset.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#c8b45a',
??????? rot = 45, grid = True)
plt.title('Correlation: Contract type vs. Churn \n')
Month-to-month type of subscription is most exposed to a churn risk. Longer contract duration is a good churn prevention mechanism.
Correlation: Payment method vs. Churn
# Correlation: PaymentMethod vs. Churn
plt.figure(8)
ds_payment_method_corr = \
??? dataset[['PaymentMethod_Bank transfer (automatic)', \
????????? 'PaymentMethod_Credit card (automatic)', \
????????? 'PaymentMethod_Electronic check', 'PaymentMethod_Mailed check']]
correlations = ds_payment_method_corr.corrwith(dataset.Churn)
correlations = correlations[correlations!=1]
correlations.plot.bar(
??????? figsize = (18, 10),
??????? fontsize = 15,
??????? color = '#00a8e6',
??????? rot = 45, grid = True)
plt.title('Correlation: Payment method vs. Churn \n')
Reasons for a positive correlation between electronic check as a payment method and Churn have to be investigated.
Multicollinearity
A high correlation between two or more independent variables leads to a phenomenon in data science called multicollinearity. In other words, an independent variable can be predicted from another independent variable. Variables with high multicollinearity are redundant, and they make it hard to interpret the model and could create an overfitting problem.
VIF (Variable Inflation Factors) is a great tool to check multicollinearity. VIF determines the strength of the correlation of a variable with a group of other independent variables in a dataset. VIF starts at 1, and if the 10 value is exceeded, that will indicate high multicollinearity between the independent variables.
Note: Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets (https://www.investopedia.com/terms/o/overfitting.asp).
#--------------------#
# MULTICOLLINEARITY? #
#--------------------#
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(X):
# Calculating Variable Inflation Factors
??? vif = pd.DataFrame()
??? vif["variables"] = X.columns
??? vif["Variable Inflation Factors"] = [variance_inflation_factor(X.values, i)
??? for i in range(X.shape[1])]
??? return(vif)
ds_vif = dataset[['gender', 'SeniorCitizen', 'Partner', 'Dependents',\
??????????????????? 'tenure', 'PhoneService', 'PaperlessBilling', \
??????????????????? 'MonthlyCharges','TotalCharges']]
vif = calculate_vif(ds_vif)
We can notice that the features "Monthly Charges" and "Total Charges" have a high VIF value. Let's use a scatter graph and plot "Monthly Charges" and "Total Charges" values to check how they correlate.
plt.figure(9)
ds_vif[['MonthlyCharges', 'TotalCharges']]\
??????????? .plot.scatter(figsize = (15, 10),
????????????????? x ='MonthlyCharges',
? ????????????????y='TotalCharges',
????????????????? color = '#e20074')
plt.title('Monthly Charges vs. Total Charges collinearity \n')
From the scatter graph, we can see that variables (features) "TotalCharges" and"Monthly Charges" are collinear. Dropping one of those features will reduce the multicollinearity between correlated features. The best approach is to drop the "Total Charges" feature and keep the "Monthly Charges" variable due to its high positive correlation with Churn.
# we will drop 'TotalCharges' from VIF test dataset
ds_vif2 = ds_vif.drop(columns = "TotalCharges")
# check colinearity again
vif2 = calculate_vif(ds_vif2)
Dropping the "Total Charges" variable reduced the multicollinearity between correlated features in the test dataset (including "tenure").
Next, "Total Charges" must also be dropped from the main dataset used for Machine Learning algorithms in the final stage of our analysis.
# **** drop the "Total Charges" from main dataset ****
dataset = dataset.drop(columns = "TotalCharges")
print("\n\n-------------------------- dataset -------------------------")
print(dataset.dtypes)
We have all numeric values, and we are ready for the next step!
Machine Learning
There are two main types of machine learning problems:
SUPERVISED LEARNING
It is done on a set of historical data points which we want to use to predict the future. Further categorization of supervised learning:
The classification problem in machine learning is a predictive modeling problem. That refers to the type of problem where a class label is predicted for a provided sample of input data (e.g., recognize a handwritten character and classify it as a known character, check if the mail is spam).
On another side, the regression problem refers to predicting a continuous quantity output for a given sample of data (e.g., predict the house price based on the size of the house, availability of schools in the area, and other essential factors, predict sales revenue based on historical sales data).
The target variable for the churn has two states: yes or no / 1 or 0
This is a binary classification problem!
UNSUPERVISED LEARNING
Used to resolve problems for which we have little or no idea what the results should be. The algorithm is looking for hidden features of data and clustering the data most sensible way based on the available data (Neural Networks - remember chatbot LEX session).
Split the dataset into dependent and independent variables
Installing scikit-learn:
pip install -U scikit-learn
The next step is to split the dataset into X and Y values.
y = ‘Churn’ column (response)
X = other independent variables in the dataset.
# Split the dataset into dependent and independent variables
X = dataset.drop(columns = ['Churn'])
y = dataset['Churn'].values
from sklearn.model_selection import train_test_split
# "train_test_split" will split arrays or matrices into random train and test subsets.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30, random_state=42, stratify=y)
“train_test_split” parameters description:
You should provide either "train_size" or "test_size". The default share of the dataset that will be used for testing is 0.25 or 25%.
"random_state" sets the seed to the random generator (splits deterministic). If the seed is not set, it is different each time. Why 42? No reason. Read Douglas Adams' "The Hitchhiker's Guide to the Galaxy". 42 is the number from which all meaning ("life, the universe, and everything") could be derived.
"stratify" parameter is used to resolve the imbalance in the sample. Stratify parameter ensures that a split, in the proportion of values in the sample, is the same as the proportion of values provided to parameter stratify. For example, if variable y is a categorical variable with values 0 and 1 with 20% of zeros and 80% of ones, stratify=y will ensure that a random split has 20% of zeros and 80% of ones.
ML predictions through various models and algorithms
# Let’s introduce a DataFrame for comparison of ML algorithms.
model_comparison = pd.DataFrame(columns=['Model','Accuracy','Execution time'])
1) LOGISTIC REGRESSION
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.
# LOGISTIC REGRESSION
from sklearn.linear_model import LogisticRegression
logistic_regression_model = LogisticRegression()
# mesuremnet of execution time
import time
t0 = time.time()
logistic_regression_model.fit(X_train, y_train)
t1 = time.time()
Evaluation of model Accuracy:
accuracy_logistic_regression = logistic_regression_model.score(X_test,y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Logistic Regression: ", accuracy_logistic_regression)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
-----------------------------------------------------------
Accuracy of Logistic Regression:?0.8004739336492891
Execution time: 0.12232900 seconds
-----------------------------------------------------------
Additionally, we can introduce classification_report:
from sklearn.metrics import confusion_matrix, classification_report
logistic_regression_prediction = logistic_regression_model.predict(X_test)
logistic_regression_report = classification_report(y_test, logistic_regression_prediction)
print(logistic_regression_report)
Metrics Definition
Confusion matrix
A confusion matrix is a table used to describe the performance of a classification model on a data set used for test and for which the true values are known.
plt.figure(10)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, logistic_regression_prediction),
??????????????? annot=True, fmt="d", linecolor="k", linewidths=3)
plt.title("Logistic Regression Confusion Matrix", fontsize=16)
plt.show()
As a final step of using the Logistic Regression algorithm, results are entered into DataFrame for model comparison.
model_comparison = model_comparison.append(
??? {'Model': 'Logistic Regression',
???? 'Accuracy': accuracy_logistic_regression,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
2)?DECISION TREE
Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems.
A decision tree simply asks a question and based on the answer (Yes/No), it further split the tree into subtrees. It is a fast technique but lacks accuracy.
# DECISION TREE
from sklearn.tree import DecisionTreeClassifier
decision_tree_model = DecisionTreeClassifier()
t0 = time.time()
decision_tree_model.fit(X_train,y_train)
t1 = time.time()
?
accuracy_decision_tree = decision_tree_model.score(X_test, y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Decision Tree: ", accuracy_decision_tree)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
# Decision Tree Classifier gives very low accuracy score.
decision_tree_prediction = decision_tree_model.predict(X_test)
plt.figure(11)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, decision_tree_prediction),
??? ????????????annot=True, fmt="d", linecolor="k", linewidths=3)
plt.title("Decision Tree Classifier Confusion Matrix", fontsize=16)
plt.show()
model_comparison = model_comparison.append(
??? {'Model': 'Decision Tree Classifier',
???? 'Accuracy': accuracy_decision_tree,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
3)?RANDOM FOREST
The Random Forest classifier contains more decision trees on various subsets of the given dataset. The predictive accuracy of that dataset is improved by taking the prediction from every tree. Bulk prediction votes of every random tree in the forest are processed and the ultimate output is given.
# RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(n_estimators=500,
?????????????????????? ???????????oob_score = True, n_jobs = -1,
????????????????????????????????? random_state=42, max_features = "auto",
????????????????????????????????? max_leaf_nodes = 30)
t0 = time.time()
random_forest_model.fit(X_train, y_train)
t1 = time.time()
accuracy_random_forest = random_forest_model.score(X_test, y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Random Forest: ", accuracy_random_forest)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
random_forest_prediction = random_forest_model.predict(X_test)
plt.figure(12)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, random_forest_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)
plt.title("Random Forest Classifier Confusion Matrix", fontsize=16)
plt.show()
model_comparison = model_comparison.append(
??? {'Model': 'Random Forest Classifier',
???? 'Accuracy': accuracy_random_forest,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
4)?SUPPORT VECTOR MACHINE (SVC)
Support Vector Machine (SVM) is a popular Supervised Learning algorithm used for classification and regression problems. Primarily usage is for Classification problems in Machine Learning. The SVM algorithm creates the best fit line or decision boundary to segregate n-dimensional space into classes. This approach can quickly put the new data point in the correct category in the future. Hyperplane relates to the best decision boundary for segregating n-dimensional space into classes.
# SUPPORT VECTOR MACHINE
from sklearn.svm import SVC
svc_model = SVC(random_state = 42)
t0 = time.time()
svc_model.fit(X_train,y_train)
t1 = time.time()
accuracy_svc = svc_model.score(X_test,y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of Support Vector Machine: ", accuracy_svc)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
svc_prediction = svc_model.predict(X_test)
plt.figure(13)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, svc_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)
plt.title("Support Vector Machine Confusion Matrix", fontsize=16)
plt.show()
model_comparison = model_comparison.append(
??? {'Model': 'Support Vector Machine',
???? 'Accuracy': accuracy_svc,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
5)?K-NEAREST NEIGHBOR (KNN)
K-Nearest Neighbor (KNN) is a classification algorithm used for assigning a class to a new data point. K is an integer value specified by the user, and the classifier determines the class of a data point by the majority voting principle.
For example, for K=4, the algorithm checks classes of 4 closest points, and the majority class will determine the prediction.
# K-NEAREST NEIGHBOR (KNN)
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors = 10)
t0 = time.time()
knn_model.fit(X_train,y_train)
t1 = time.time()
accuracy_knn = knn_model.score(X_test,y_test)
print("\n\n-----------------------------------------------------------")
print("Accuracy of K-Nearest Neighbor: ", accuracy_knn)
print("Execution time: %0.8f seconds" % (t1 - t0))
print("-----------------------------------------------------------")
knn_prediction = knn_model.predict(X_test)
plt.figure(14)
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, knn_prediction),
??????????????? annot=True, fmt = "d", linecolor="k", linewidths=3)
plt.title("K-Nearest Neighbor Confusion Matrix", fontsize=16)
plt.show()
model_comparison = model_comparison.append(
??? {'Model': 'K-Nearest Neighbor',
???? 'Accuracy': accuracy_knn,
???? 'Execution time': '%0.8f seconds' % (t1 - t0)}, ignore_index = True)
Model comparison
Our model comparison was made on only two simple measures: ”Accuracy” and ”Execution time.” More complex metrics can be developed, and model performance can significantly improve with hypertuning of parameters in ML Models.
Even when we look at the results, it is clear that some models provide optimal accuracy and execution time. In some situations, we need high accuracy, and time is not a factor. In other cases, fast decision-making is required, and “false positives” are not a problem (e.g., search engines).
This led us to the end of this tutorial!
Now you have firsthand experience with some basics of machine learning, and you learned some cool Python tricks.
Start exploring the data science universe!
Neven Dujmovi?, April 2022
?
#python?#dataanalytics #ai #artificialintelligence #ml #machinelearning #datascience #customerchurn # telco #pandas # sklearn #dataanalysis #anaconda
Data Protection Officer@HT Group (Hrvatski Telekom, Combis, HT Servisi) | CIPP/E
2 年Great article! You sure do have a way of motivating people to learn new things. Keep up the good work!