Day 03 - Decision Trees

Day 03 - Decision Trees

  • Concept: Tree-based model for classification/regression
  • Implementation: Recursive splitting
  • Evaluation: Accuracy, Gini impurity

CONCEPT

Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure where internal nodes represent tests on features, branches represent the outcome of the test, and leaf nodes represent the final prediction (class label or value).

For classification, decision trees use measures like Gini impurity or entropy to split the data:

  • Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.
  • Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.

For regression, decision trees minimize the variance (mean squared error) in the splits.

IMPLEMENTATION EXAMPLE

Suppose we have a dataset with features like age, income, and student status to predict whether a person buys a computer.

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt        
# Example data

data = {
    'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
    'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium'],
    'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'],
    'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
}

df = pd.DataFrame(data)
df        
# Converting categorical features to numeric

df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})        
# Defining independent variables (features) and dependent vaariables (target)

X = df[['Age', 'Income', 'Student']]
y = df[['Buys_Computer']]        
# Splitting the data into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)        
# Creating and training the decision tree model

model = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state = 42)
model.fit(X_train, y_train)        
# Making predictions

y_pred = model.predict(X_test)        
# Evaluating the model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Acccuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')        
# Plotting the decision tree

plt.figure(figsize = (12, 8))
plot_tree(model, feature_names = ['Age', 'Income', 'Student'], class_names = ['No', 'Yes'], filled = True)
plt.title('Decision Tree')
plt.show()        

EXPLANATION OF THE CODE

  1. Libraries: We import necessary libraries like numpy, pandas, sklearn, and matplotlib.
  2. Data Preparation: We create a DataFrame containing features and the target variable. Categorical features are converted to numeric values.
  3. Feature and Target: We separate the features (Age, Income, Student) and the target (Buys_Computer).
  4. Train-Test-Split: We split the data into training and testing sets.
  5. Model Training: We create a DecisionTreeClassifier model, specifying the criterion (Gini impurity) and maximum depth of the tree, and train it using the training data.
  6. Predictions: We use the trained model to predict whether a person buys a computer for the test set.
  7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
  8. Visualization: Plot the decision tree to visualize the decision-making process.

EVALUATION METRICS

  • Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
  • Classification Report: Provides precision, recall F1-score, and support for each class.

Download the Jupyter Notebook file for Day 03 here.

要查看或添加评论,请登录

Ime Eti-mfon的更多文章

社区洞察

其他会员也浏览了