DECISION TREES AND TITANIC DATASET
Giancarlo Ronci
Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd
#MachineLearning #DecisionTree #DataScience #Classification #RProgramming
Decision trees are machine learning algorithms that are widely used for both classification and regression. They work by repeatedly dividing a dataset into subsets based on a set of rules derived from the characteristics of the data.
Decision tree structure
A decision tree consists of:
Example of operation (Classification)
Let's imagine that we want to classify whether or not a person will accept a credit card offer, using characteristics such as age, income, and the number of credit cards they own. The decision tree will follow a? data division process? based on these characteristics. For example:
In the end, each path from the beginning to the end of the tree leads to a predicted class: in this case, it will accept the credit card offer or not.
How the tree decides on divisions (subdivision criteria)
Decision trees use several criteria to decide how to divide data into nodes. Some of these criteria include:
Benefits of Decision Trees
Detriments
Improvements and variants
Popular algorithms
Here is an example of a decision tree in R using a dataset, the famous "Titanic" dataset. This dataset contains information about the passengers of the Titanic and allows us to build a model to predict whether a passenger survived or not based on variables such as age, gender, ticket class, etc.
Dataset Titanic
The Titanic dataset? is one of the most famous and commonly used in the field of data science and machine learning. It is based on passenger data on the famous ocean liner Titanic, which sank in 1912 after hitting an iceberg. The dataset is often used for binary classification (survivor/non-survivor) tutorials.
Key features of the Titanic dataset:
Main variables:
You can get the Titanic dataset directly with R's titanic library. In this example, we use the rpart package to create a decision tree and predict the probability of survival of the Titanic's passengers.
R Code Sample
# Installation and loading of necessary packages
if(!require(titanic)) install.packages("titanic")
if(!require(rpart)) install.packages("rpart")
if(!require(rpart.plot)) install.packages("rpart.plot")
library(titanic) # Titanic dataset
library(rpart)?????? # Algorithm for decision trees
library(rpart.plot)? # Improved tree visualization
?
# Load the Titanic dataset (modify the original dataset to simplify)
data("titanic_train")
data <- titanic_train
?
# Data cleaning
data$Age[is.na(data$Age)] <- median(data$Age, na.rm = TRUE)? # Replace NA values in age with the median
data$Embarked[is.na(data$Embarked)] <- 'S'? # Replace NA values in the 'Embarked' column with 'S'
data$Survived <- factor(data$Survived)? # Convert the target variable into a factor (category)
data$Pclass <- factor(data$Pclass)????? # Convert the passenger class into a factor
?
# Select some useful columns for analysis
data <- data[, c("Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")]
?
# Split the data into training (70%) and test (30%)
set.seed(123)
领英推荐
index <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[index, ]
test_data <- data[-index, ]
?
# Build the decision tree
# rpart(formula, data, method)
model <- rpart(Survived ~ ., data = train_data, method = "class",
?????????????? control = rpart.control(cp = 0.01))? # Use pruning complexity
?
# Visualize the decision tree
print(model)
rpart.plot(model, type = 2, extra = 104, fallen.leaves = TRUE)
?
# Align the levels of the test set with those of the training set
test_data$Embarked <- factor(test_data$Embarked, levels = levels(train_data$Embarked))
?
# Predictions on the test set
predictions <- predict(model, test_data, type = "class")? # Make predictions on the test set
?
# Confusion matrix to evaluate performance
confusion_matrix <- table(test_data$Survived, predictions)
print(confusion_matrix)
Output:
?
# Calculate accuracy
Accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Model accuracy:", accuracy, "\n")
Output:
So in the code we can see that thanks to the R code, we are able to build a decision tree model based on the training dataset. Now we can also view the output of the code and comment it out together:
The print(model) command that you have seen in the code presented above, in the context of a model created with the rpart() function (which constructs decision trees) returns a textual description of the decision tree model. This output shows relevant information about the nodes in the tree and the breakdowns made based on variables.
Typical interpretation of print(model) output:
The output of print(model) for a decision tree might appear in a format similar to the following:
Detail of the main fields:
How to read the tree:
In summary:
The output shows the splitting rules created by the decision tree, indicating how independent variables (e.g. Sex, Pclass, Age) are used to predict the dependent variable (Survived). Each node represents a division of the data, and the trailing leaves (terminal nodes) indicate the predicted classes for the various subsets of data.
The output of this example also shows the confusion matrix that shows how many times the model ranked well and how many times it found a wrong class:
?
This confusion matrix has been general from the code below
# Confusion matrix to evaluate performance
confusion_matrix <- table(test_data$Survived, predictions)
print(confusion_matrix)
We can note that:
1)????? There are 146 +68 predictions that are correct compared to reality
2)????? There are 22-32 wrong predictions compared to reality
Thanks to these values we can say that the accuracy of the model is 79.85%