登录查看更多内容

Day 03 - Decision Trees

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

发布日期: 2025年1月22日

+ 关注

Concept: Tree-based model for classification/regression
Implementation: Recursive splitting
Evaluation: Accuracy, Gini impurity

CONCEPT

Decision trees are a non-parametric supervised learning method used for both classification and regression tasks. They model decisions and their possible consequences in a tree-like structure where internal nodes represent tests on features, branches represent the outcome of the test, and leaf nodes represent the final prediction (class label or value).

For classification, decision trees use measures like Gini impurity or entropy to split the data:

Gini Impurity: Measures the likelihood of an incorrect classification of a randomly chosen element.
Entropy (Information Gain): Measures the amount of uncertainty or impurity in the data.

For regression, decision trees minimize the variance (mean squared error) in the splits.

IMPLEMENTATION EXAMPLE

Suppose we have a dataset with features like age, income, and student status to predict whether a person buys a computer.

# Import necessary libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt

# Example data

data = {
    'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
    'Income': ['High', 'High', 'High', 'Medium', 'Low', 'Low', 'Low', 'Medium', 'Low', 'Medium'],
    'Student': ['No', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No'],
    'Buys_Computer': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes']
}

df = pd.DataFrame(data)
df

# Converting categorical features to numeric

df['Income'] = df['Income'].map({'Low': 1, 'Medium': 2, 'High': 3})
df['Student'] = df['Student'].map({'No': 0, 'Yes': 1})
df['Buys_Computer'] = df['Buys_Computer'].map({'No': 0, 'Yes': 1})

领英推荐

NEW from Learning Data 06/20 - 06/23!

Maven Analytics 1 年前

Data Science

Arbutus Infotech Private Limited 1 年前

How to Choose the Right Machine Learning Model for…

10Alytics 6 个月前

# Defining independent variables (features) and dependent vaariables (target)

X = df[['Age', 'Income', 'Student']]
y = df[['Buys_Computer']]

# Splitting the data into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Creating and training the decision tree model

model = DecisionTreeClassifier(criterion = 'gini', max_depth = 3, random_state = 42)
model.fit(X_train, y_train)

# Making predictions

y_pred = model.predict(X_test)

# Evaluating the model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Acccuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

# Plotting the decision tree

plt.figure(figsize = (12, 8))
plot_tree(model, feature_names = ['Age', 'Income', 'Student'], class_names = ['No', 'Yes'], filled = True)
plt.title('Decision Tree')
plt.show()

EXPLANATION OF THE CODE

Libraries: We import necessary libraries like numpy, pandas, sklearn, and matplotlib.
Data Preparation: We create a DataFrame containing features and the target variable. Categorical features are converted to numeric values.
Feature and Target: We separate the features (Age, Income, Student) and the target (Buys_Computer).
Train-Test-Split: We split the data into training and testing sets.
Model Training: We create a DecisionTreeClassifier model, specifying the criterion (Gini impurity) and maximum depth of the tree, and train it using the training data.
Predictions: We use the trained model to predict whether a person buys a computer for the test set.
Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
Visualization: Plot the decision tree to visualize the decision-making process.

EVALUATION METRICS

Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
Classification Report: Provides precision, recall F1-score, and support for each class.

Download the Jupyter Notebook file for Day 03 here.

要查看或添加评论，请登录

Ime Eti-mfon的更多文章

30 Days, 30 Concepts: A Deep Dive into Machine Learning

2025年2月24日

30 Days, 30 Concepts: A Deep Dive into Machine Learning

Introduction Over the past month, I completed a 30-day Data Science learning challenge focused on Machine Learning…

3 条评论
Day 30 — Hyperparameter Optimization

2025年2月23日

Day 30 — Hyperparameter Optimization

Concept: Model tuning. Implementation: Grid search, random search.

3 条评论
Day 29 — Model Deployment and Monitoring

2025年2月22日

Day 29 — Model Deployment and Monitoring

CONCEPT Model Deployment and Monitoring involve the processes of making trained machine learning models accessible for…

1 条评论
Day 28 — Time Series Analysis and Forecasting

2025年2月21日

Day 28 — Time Series Analysis and Forecasting

CONCEPT Time Series Analysis involves analyzing data points collected over time to extract meaningful statistics and…

1 条评论
Day 27 — Natural Language Processing (NLP)

2025年2月20日

Day 27 — Natural Language Processing (NLP)

CONCEPT Natural Language Processing (NLP) is a field of artificial intelligence focused on enabling computers to…

1 条评论
Day 26?-?Ensemble?Learning

2025年2月20日

Day 26?-?Ensemble?Learning

CONCEPT Ensemble learning is a machine learning technique where multiple models (learners) are trained to solve the…

1 条评论
Day 25 — Transfer Learning

2025年2月19日

Day 25 — Transfer Learning

Concept: Pre-trained models. Implementation: Fine-tuning.

1 条评论
Day 24 - Generative Adversarial Networks (GANs)

2025年2月18日

Day 24 - Generative Adversarial Networks (GANs)

Concept: Generative models. Implementation: Generator, discriminator.

5 条评论
Day 23 — Autoencoders

2025年2月17日

Day 23 — Autoencoders

Concept: Data compression. Implementation: Encoder, decoder.

1 条评论
Day 22 — Gated Recurrent Units (GRU)

2025年2月12日

Day 22 — Gated Recurrent Units (GRU)

Concept: Simplified LSTM. Implementation: Update gate.

3 条评论

See all articles

Day 03 - Decision Trees

Ime Eti-mfon

Data Scientist | Machine Learning Engineer | Data Program Community Ambassador @ ALX

CONCEPT

IMPLEMENTATION EXAMPLE

领英推荐

EXPLANATION OF THE CODE

EVALUATION METRICS

Ime Eti-mfon的更多文章

社区洞察

其他会员也浏览了

How Data Science Is Changing The Way World Works

The Lego Bricks of Data Science

A Data Sapient Guide to Feature Engineering: Handling Missing Data

Power Query - Built Algorithm, "How to"

Feature Engineering: Turning Raw Data into Gold

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

???? KNIME Machine Learning Pipeline: House Price Predictor! ????

Linear Regression: Predictive Modeling Made Simple

Mastering Feature Engineering

Eigen value and Eigen vector

CONCEPT

IMPLEMENTATION EXAMPLE

领英推荐

EXPLANATION OF THE CODE

EVALUATION METRICS

Ime Eti-mfon的更多文章

30 Days, 30 Concepts: A Deep Dive into Machine Learning

Day 30 — Hyperparameter Optimization

Day 29 — Model Deployment and Monitoring

Day 28 — Time Series Analysis and Forecasting

Day 27 — Natural Language Processing (NLP)

Day 26?-?Ensemble?Learning

Day 25 — Transfer Learning

Day 24 - Generative Adversarial Networks (GANs)

Day 23 — Autoencoders

Day 22 — Gated Recurrent Units (GRU)

社区洞察

其他会员也浏览了

How Data Science Is Changing The Way World Works

The Lego Bricks of Data Science

A Data Sapient Guide to Feature Engineering: Handling Missing Data

Power Query - Built Algorithm, "How to"

Feature Engineering: Turning Raw Data into Gold

Notes - Decision Trees, Random Forests, Bagging, Boosting (AdaBoost, XGBoost), Stacking

???? KNIME Machine Learning Pipeline: House Price Predictor! ????

Linear Regression: Predictive Modeling Made Simple

Mastering Feature Engineering

Eigen value and Eigen vector