RANDOM FOREST MODEL(RFM)
Random Forest?Model:
The random forest model is also a classification model with the combination of the decision tree. The random forest algorithm is a supervised classification algorithm. As the name suggests, this algorithm creates a forest with several trees. … In the same way in the random tree classifier, the higher the number of trees in the forest gives the high the accuracy results. If you know the Random forest algorithm is a supervised classification algorithm.
The random forest model follows an ensemble technique. It involves constructing multi decision trees at training time. Its prediction based on mode for classification and mean for regression tree. It helps to reduce the overfitting of the individual decision tree. There are many possibilities for the occurrence of overfitting.
Working of Random Forest Algorithm
We can understand the working of the Random Forest algorithm with the help of following steps ?
Sampling without replacement stats that the training data split into several small samples and then the result we get is a combination of all the data set. If we have 1000 features in a data set the splitting will happen with 10 features each in a small training data and all split training data contains equal no of features. The result is based on which training data has the highest value.
Step 4?? At last, select the most voted prediction result as the final prediction result.
What is the Out of Bag score in Random Forests?
Out of bag (OOB) score is a way of validating the Random forest model. Below is a simple intuition of how is it calculated followed by a description of how it is different from the validation score and where it is advantageous.
For the description of the OOB score calculation, let’s assume there are five DTs in the random forest ensemble labeled from 1 to 5. For simplicity, suppose we have a simple original training data set as below.
OOB Error Rate Computation Steps
Variable importance of RF:?
It stats about the feature that is most useful for the random forest model by which we can get the high accuracy of the model with less error.
Finding the optimal values using grid-search cv:
It stats the optimal values of the splitting decision tree that how many trees to be split?within the model.
Measuring RF model performance by Confusion Matrix:
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. It tells about how many true values are true.
Random Forest with python:?
Importing the important libraries–
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import svm
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
Read the data from csv
dummy_df = pd.read_csv("bank.csv", na_values =['NA'])
temp = dummy_df.columns.values[0]
temp
print(dummy_df)
Data Pre-Processing:
columns_name = temp.split(';')
data = dummy_df.values
print(data)
print(data.shape)
contacts = list()
for element in data:
contact = element[0].split(';')
contacts.append(contact)
contact_df = pd.DataFrame(contacts,columns = columns_name)
print(contact_df)
def preprocessor(df):
res_df = df.copy()
le = preprocessing.LabelEncoder()
?encoded_df = preprocessor(contact_df)
#encoded_df = preprocessor(contacts)
x = encoded_df.drop(['"y"'],axis =1).values
y = encoded_df['"y"'].values
Split the data into Train-Test?
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size =0.5)
Build the Decision Tree Model
# Decision tree with depth = 2
model_dt_2 = DecisionTreeClassifier(random_state=1, max_depth=2)
model_dt_2.fit(x_train, y_train)
model_dt_2_score_train = model_dt_2.score(x_train, y_train)
print("Training score: ",model_dt_2_score_train)
model_dt_2_score_test = model_dt_2.score(x_test, y_test)
print("Testing score: ",model_dt_2_score_test)
#y_pred_dt = model_dt_2.predict_proba(x_test)[:, 1]
#Decision tree
model_dt = DecisionTreeClassifier(max_depth = 8, criterion ="entropy")
model_dt.fit(x_train, y_train)
y_pred_dt = model_dt.predict_proba(x_test)[:, 1]
Graphical Representation of Tree
plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Performance Metrics
fpr_dt, tpr_dt, _ = roc_curve(y_test, y_pred_dt)
roc_auc_dt = auc(fpr_dt, tpr_dt)
predictions = model_dt.predict(x_test)
# Model Accuracy
print (model_dt.score(x_test, y_test))
y_actual_result = y_test[0]
领英推荐
for i in range(len(predictions)):
if(predictions[i] == 1):
y_actual_result = np.vstack((y_actual_result, y_test[i]))
Recall
#Recall
y_actual_result = y_actual_result.flatten()
count = 0
for result in y_actual_result:
if(result == 1):
count=count+1
print ("true yes|predicted yes:")
print (count/float(len(y_actual_result)))
Area Under the Curve
plt.figure(1)
lw = 2
plt.plot(fpr_dt, tpr_dt, color='green',
lw=lw, label='Decision Tree(AUC = %0.2f)' % roc_auc_dt)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Area Under Curve')
plt.legend(loc="lower right")
plt.show()
Confusion Matrix
print (confusion_matrix(y_test, predictions))
accuracy_score(y_test, predictions)
import itertools
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(model, normalize=False): # This function prints and plots the confusion matrix.
cm = confusion_matrix(y_test, model, labels=[0, 1])
classes=["Success", "Default"]
cmap = plt.cm.Blues
title = "Confusion Matrix"
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = np.around(cm, decimals=3)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.figure(figsize=(6,6))
plot_confusion_matrix(predictions, normalize=False)
plt.show()
?
Pruning of the tree?
from sklearn.tree._tree import TREE_LEAF
def prune_index(inner_tree, index, threshold):
if inner_tree.value[index].min() < threshold:
# turn node into a leaf by "unlinking" its children
inner_tree.children_left[index] = TREE_LEAF
inner_tree.children_right[index] = TREE_LEAF
# if there are shildren, visit them as well
if inner_tree.children_left[index] != TREE_LEAF:
prune_index(inner_tree, inner_tree.children_left[index], threshold)
prune_index(inner_tree, inner_tree.children_right[index], threshold)
print(sum(model_dt.tree_.children_left < 0))
# start pruning from the root
prune_index(model_dt.tree_, 0, 5)
sum(model_dt.tree_.children_left < 0)
#It means that the code has created 17 new leaf nodes
#(by practically removing links to their ancestors). The tree, which has looked before like
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
plt.figure(figsize=(6,6))
dot_data = StringIO()
export_graphviz(model_dt, out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Learnbay ?provides industry accredited data science courses in Bangalore. We understand the conjugation of technology in the field of Data science hence we offer significant courses like Machine learning, Tensor Flow, IBM Watson, Google Cloud platform, Tableau, Hadoop, time series, R and Python. With authentic real-time industry projects. Students will be efficient by being certified by IBM. Around hundreds of students are placed in promising companies for data science roles. By choosing Learnbay you will reach the most aspiring job of present and future.
Learnbay data science course covers Data Science with Python, Artificial Intelligence with Python, Deep Learning using Tensor-Flow. These topics are covered and co-developed with IBM.
?