登录查看更多内容

DECISION TREES AND TITANIC DATASET

Giancarlo Ronci

Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd

发布日期: 2024年10月23日

#MachineLearning #DecisionTree #DataScience #Classification #RProgramming

Decision trees are machine learning algorithms that are widely used for both classification and regression. They work by repeatedly dividing a dataset into subsets based on a set of rules derived from the characteristics of the data.

Decision tree structure

A decision tree consists of:

Root node : This is the starting point of the tree. Here, initial decisions are made based on a variable.
Inner nodes: Each inner node represents a condition on a trait. From these nodes several branches branch off.
Branches: represent the possible choices or values that a characteristic can take. They lead to the next nodes.
Leaf nodes: The end nodes that can no longer be split further. Each leaf represents a prediction or a final class.

Example of operation (Classification)

Let's imagine that we want to classify whether or not a person will accept a credit card offer, using characteristics such as age, income, and the number of credit cards they own. The decision tree will follow a? data division process? based on these characteristics. For example:

At the root node, the algorithm might decide to split the data by age: If the age is greater than 30, a branch is followed. If it is less than or equal to 30 years, another branch is followed.
Then, on each branch, further divisions can be made, for example, by considering income.

In the end, each path from the beginning to the end of the tree leads to a predicted class: in this case, it will accept the credit card offer or not.

How the tree decides on divisions (subdivision criteria)

Decision trees use several criteria to decide how to divide data into nodes. Some of these criteria include:

Gini index: measure of the impurity of a knot. A node is pure when all its instances belong to the same class. The algorithm tries to minimize the Gini index in each division.
Entropy and information gain: measures the amount of uncertainty in a set of data. Entropy is zero if all data belong to the same class. Information gain is the reduction in entropy obtained by dividing the data.
Variance reduction (for regression): Measures how much division reduces the variance between target values on child nodes.

Benefits of Decision Trees

Simple to interpret: The logic behind a decision tree is intuitive and easy to understand.
No need to normalize data: Decision trees don't require data to be at similar scales or normalized.
They handle mixed data well: they can work with both categorical data and numerical data.

Detriments

Overfitting: Decision trees tend to create very complex models that fit well with training data, but may generalize poorly to new data. This happens when the tree grows too deep, while also capturing the noise of the data.
Instability: Small changes in the data can lead to the creation of very different trees.

Improvements and variants

Pruning: To avoid over-fattening, a pruning technique can be used to remove unnecessary branches from the tree.
Random Forest: An improvement to the decision tree is the Random Forest algorithm, which builds many decision trees on different subsamples of the data and combines their results to obtain a more robust final prediction.
Gradient Boosting: Another advanced variant is Gradient Boosting, which sequentially builds decision trees, each trying to correct the errors of the previous one.

Popular algorithms

ID3 (Iterative Dichotomiser 3): one of the best-known algorithms for creating decision trees using information gain.
CART (Classification And Regression Trees): Another popular algorithm that uses the Gini index for classification and variance for regression.

Here is an example of a decision tree in R using a dataset, the famous "Titanic" dataset. This dataset contains information about the passengers of the Titanic and allows us to build a model to predict whether a passenger survived or not based on variables such as age, gender, ticket class, etc.

Dataset Titanic

The Titanic dataset? is one of the most famous and commonly used in the field of data science and machine learning. It is based on passenger data on the famous ocean liner Titanic, which sank in 1912 after hitting an iceberg. The dataset is often used for binary classification (survivor/non-survivor) tutorials.

Key features of the Titanic dataset:

Objective: To predict the survival of a passenger (target variable) based on other passenger characteristics.

Main variables:

PassengerId: Unique identifier for each passenger.
Survived: Target variable. Indicates whether the passenger survived (1) or not (0).
Pclass: Passenger's class (1 = first class, 2 = second class, 3 = third class).
Name: Name of the passenger.
Sex: Gender of the passenger (male/female).
Age: Age of the passenger.
SibSp: Number of siblings or spouses on board.
Parch: Number of parents or children on board.
Ticket: Ticket number.
Do: Fare paid for the ticket.
Cabin: Cabin number (very often missing).
Embarked: Port of embarkation of the passenger (C = Cherbourg, Q = Queenstown, S = Southampton).

You can get the Titanic dataset directly with R's titanic library. In this example, we use the rpart package to create a decision tree and predict the probability of survival of the Titanic's passengers.

R Code Sample

# Installation and loading of necessary packages

if(!require(titanic)) install.packages("titanic")

if(!require(rpart)) install.packages("rpart")

if(!require(rpart.plot)) install.packages("rpart.plot")

library(titanic) # Titanic dataset

library(rpart)?????? # Algorithm for decision trees

library(rpart.plot)? # Improved tree visualization

# Load the Titanic dataset (modify the original dataset to simplify)

data("titanic_train")

data <- titanic_train

# Data cleaning

data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE)? # Replace NA values in age with the median

data$Embarked[is.na(data$Embarked)] <- 'S'? # Replace NA values in the 'Embarked' column with 'S'

data$Survived <- factor(data$Survived)? # Convert the target variable into a factor (category)

data$Pclass <- factor(data$Pclass)????? # Convert the passenger class into a factor

# Select some useful columns for analysis

data <- data[, c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]

# Split the data into training (70%) and test (30%)

set.seed(123)

领英推荐

I ran 580 model-dataset experiments to show that, even…

Santiago Viquez 9 个月前

Decision Tree Classifier - Explained

Santosh Kamble 3 年前

How to speed up tabular data processing by 1053x in…

Radek Osmulski 2 年前

index <- sample(1:nrow(data), size = 0.7 * nrow(data))

train_data <- data[index, ]

test_data <- data[-index, ]

# Build the decision tree

# rpart(formula, data, method)

model <- rpart(Survived ~ ., data = train_data, method = "class",

?????????????? control = rpart.control(cp = 0.01))? # Use pruning complexity

# Visualize the decision tree

print(model)

rpart.plot(model, type = 2, extra = 104, fallen.leaves = TRUE)

# Align the levels of the test set with those of the training set

test_data$Embarked <- factor(test_data$Embarked, levels = levels(train_data$Embarked))

# Predictions on the test set

predictions <- predict(model, test_data, type = "class")? # Make predictions on the test set

# Confusion matrix to evaluate performance

confusion_matrix <- table(test_data$Survived, predictions)

print(confusion_matrix)

Output:

# Calculate accuracy

Accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

cat("Model accuracy:", accuracy, "\n")

Output:

So in the code we can see that thanks to the R code, we are able to build a decision tree model based on the training dataset. Now we can also view the output of the code and comment it out together:

The print(model) command that you have seen in the code presented above, in the context of a model created with the rpart() function (which constructs decision trees) returns a textual description of the decision tree model. This output shows relevant information about the nodes in the tree and the breakdowns made based on variables.

Typical interpretation of print(model) output:

The output of print(model) for a decision tree might appear in a format similar to the following:

Detail of the main fields:

n= 623: Indicates the total number of observations (rows) used to create the model. In this case, the tree was built on 623 observations (e.g., passengers in the Titanic dataset).
node), split, n, loss, yval, (yprob): node: The number of the node within the decision tree. split: The condition or rule used to split the data in that node. For example, Sex=male indicates that the node has been divided by the Sex variable. n: The number of observations that are in that particular node of the tree. loss: The number of misclassified observations in that node. yval: The value of the response variable (the expected class) for that node. In a binary classification problem such as Titanic (surviving or unsurviving), yval could be "Yes" or "No". yprob: The probability associated with each class. It is represented as a pair of values in parentheses, where the first value corresponds to the probability that the observation belongs to class "No" and the second value to class "Yes". For example, (0.62710753 0.37289247) indicates a 62.7% chance of belonging to class "No" and a 37.3% chance of belonging to class "Yes".
* (asterisk): It denotes that the node is terminal, i.e. it does not split further. This means that no further data splitting takes place in this node and represents a leaf in the tree.

How to read the tree:

The root node is the starting node of the tree: It contains all 623 observations, of which 233 are misclassified with respect to the No (non-surviving) value. The yval value indicates that most observations belong to the "No" class, with a 62.7% probability of not surviving.
Node 2 represents a division based on sex (male): Male observations (404) are broken down further, with 109 of these incorrectly classified as "No". The probability of not surviving for males is 73%.
Other nodes continue to divide the data based on variables such as Pclass (passenger class), Age (age) and Fare (fare paid), until the terminal nodes marked with an asterisk are reached.

In summary:

The output shows the splitting rules created by the decision tree, indicating how independent variables (e.g. Sex, Pclass, Age) are used to predict the dependent variable (Survived). Each node represents a division of the data, and the trailing leaves (terminal nodes) indicate the predicted classes for the various subsets of data.

The output of this example also shows the confusion matrix that shows how many times the model ranked well and how many times it found a wrong class:

This confusion matrix has been general from the code below

# Confusion matrix to evaluate performance

confusion_matrix <- table(test_data$Survived, predictions)

print(confusion_matrix)

We can note that:

1)????? There are 146 +68 predictions that are correct compared to reality

2)????? There are 22-32 wrong predictions compared to reality

Thanks to these values we can say that the accuracy of the model is 79.85%

要查看或添加评论，请登录

Giancarlo Ronci的更多文章

Apache AirFlow

2025年1月11日

Apache AirFlow

Apache Airflow è uno scheduler open source molto popolare per la gestione di flussi di lavoro e pipeline di dati. Ecco…

1 条评论
Clustering USArrests Dataset using K-means Method

2024年11月19日

Clustering USArrests Dataset using K-means Method

URL: https://www.kaggle.
[ITA] SUPPORT VECTOR MACHINE E PYTHON

2024年11月12日

[ITA] SUPPORT VECTOR MACHINE E PYTHON

La metodologia delle Support Vector Machine (SVM) è molto diffusa in data science per problemi di classificazione e, in…
[ITA] Alberi decisionali in R, e dataset TITANIC

2024年10月20日

[ITA] Alberi decisionali in R, e dataset TITANIC

Gli alberi decisionali sono algoritmi di machine learning ampiamente utilizzati sia per la classificazione che per la…
LOGISTIC REGRESSION ON DATASET BIOPSY

2024年10月14日

LOGISTIC REGRESSION ON DATASET BIOPSY

First of all, we can say that logistic regression is a supervised learning algorithm. In a supervised learning, the…
LINEAR REGRESSION ON BOSTON DATASET

2024年10月6日

LINEAR REGRESSION ON BOSTON DATASET

The Boston dataset is a classic dataset used for regression problems, especially for predicting house prices in…
[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

2024年10月5日

[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

#datascience #machinelearning #R il Boston dataset, è un classico dataset utilizzato per problemi di regressione, in…
Data warehouse Guides and Tutorials

2017年7月20日

Data warehouse Guides and Tutorials
Data warehouse Guides and Tutorials

2017年7月20日

Data warehouse Guides and Tutorials

Here some interesting links about data warehousing: A discussion about several methods to retrieve data from the data…
Vantaggi nell'utilizzo di Hadoop

2016年12月14日

Vantaggi nell'utilizzo di Hadoop

I vantaggi di Hadoop MapReduce programmazione #HDFS #MapReduce

1 条评论

See all articles

DECISION TREES AND TITANIC DATASET

Giancarlo Ronci

Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd

领英推荐

Giancarlo Ronci的更多文章

社区洞察

其他会员也浏览了

Understanding PCA - Part 2 (High-dimensional data)

PCA vs. t-SNE

The 70% Conundrum: Why Data Scientists Spend Most of Their Time Exploring Data

Chapter 5 : K-nearest neighbors algorithm with code from scratch.

Curse of Dimensionality

Cluster Analysis

Data Cleaning

Data cleaning without domain knowledge

How Data Science Can Help the Business?

?? Mastering Binary Search Trees (BSTs) in Data Structures & Algorithms! -- Part 2 ??

领英推荐

Giancarlo Ronci的更多文章

Apache AirFlow

Clustering USArrests Dataset using K-means Method

[ITA] SUPPORT VECTOR MACHINE E PYTHON

[ITA] Alberi decisionali in R, e dataset TITANIC

LOGISTIC REGRESSION ON DATASET BIOPSY

LINEAR REGRESSION ON BOSTON DATASET

[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

Data warehouse Guides and Tutorials

Data warehouse Guides and Tutorials

Vantaggi nell'utilizzo di Hadoop

社区洞察

其他会员也浏览了

Understanding PCA - Part 2 (High-dimensional data)

PCA vs. t-SNE

The 70% Conundrum: Why Data Scientists Spend Most of Their Time Exploring Data

Chapter 5 : K-nearest neighbors algorithm with code from scratch.

Curse of Dimensionality

Cluster Analysis

Data Cleaning

Data cleaning without domain knowledge

How Data Science Can Help the Business?

?? Mastering Binary Search Trees (BSTs) in Data Structures & Algorithms! -- Part 2 ??