Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

Table of Contents:

  1. Setup Environment / Extract Data / Loading Data
  2. Data Wrangling

2.1 Missing Data Imputation (1) Missing Age

2.2 Feature Engineering (1) Title

2.3 Missing Data Imputation (2) Embarked

2.4 Missing Data Imputation (4) Fare

2.5 Missing Data Imputation (5) Cabin

3. Exploratory Analysis (ggplot2)

3.1 Age vs Survived vs Sex

3.2 PClass vs Survived

3.3 Fare vs Survived

4. Exploratory Modeling (RandomForest)

1. Setup Environment / Extract Data / Loading Data

Setup Environment: In order to start our data analysis steps, we must first setup our environment. I'll be using the R programming language as well as R Studio, which is a free open source IDE.

Packages/Libraries: The package that we'll be using in order to help with the analysis and visualization process is the tidyverse package. This package includes six commonly used packages. But for our program, I'll be using only dyplr for data manipulation and ggplot2 for graphics.

#Install the packages that will assist in the analysis and visualization
install.packages('tidyverse')
library('tidyverse')?

The next step, we'll extract and load our data into R's data frame. The two data-sets are in a csv file. One is the training data and the other is the testing data.

#Extract and Load our data into two data frames called test and train
test <- read_csv("~/Kaggle/Models/test.csv")
train <- read_csv("~Kaggle/Models/train.csv")?
2. Data Wrangling
Now that the data set is loaded, we can begin by viewing our data. It is important to understand what values your data set contains and to see if there are any weird or missing values hidden somewhere in our table. We'll use a function called colSums(is.na("___")), which will give us the amount of missing values in each column of our dataset.
#Find out how many missing values are in our dataset
colSums(is.na(train))
colSums(is.na(test))

As you can see, we have missing values in the columns Age, Embarked, and Cabin within our Titanic training set and we have missing values in the columns Age, Cabin, and Fare in our Titanic test set. Also, we can see that our test set does not have the Survived column.

To fix this problem, let's combine these two tables into one so we can clean up these columns as a whole. But in order to do that, we'll have to create a new data frame with the test data sets and add a new "Survived" variable because you cannot combine two data sets with different column numbers.

#Create a new data frame of the test set and add in "Survived" variable
test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,])

#Combine data sets
combined <- rbind(train, test.survived)

2.1 Missing Data Imputation (1): Missing Age

Let's examine our first set of missing values within our Age column. We can do a more in depth search on these values by filtering the missing ages by their Sex and Pclass.

#Not including the missing values, give me a summary of the mean, median, 
#?and standard deviation of Age that is grouped by Pclass and Sex
na.omit(combined) %>%
  group_by(Pclass, Sex) %>%
  summarise(Mean = mean(Age), Median = median(Age), SD = sd(Age), Total=n())

From this table, it looks like there's a high standard deviation between the ages. This could possibly give us a bad prediction for the ages if we decided to use the mean or median as a replacement. Let's look at some more details on our table to see if there are further clues.

2.2 Feature Engineering (1): Title

A proxy variable is an easily measurable variable that is used in place of a variable that cannot be measured or is difficult to measure. Remember, if you don’t include the intended variable in any form, your results could be biased 180 degrees from what they should be. Including an imperfect proxy of a hard-to-measure variable is often better than not including an important variable at all. So, if you can’t include the intended variable, look for a proxy! In this case, we create a Title variable and use it as a proxy for Age and Gender.

#Create a new title column that extracts the words ending with "."
combined <- combined %>% mutate(Title = str_extract(Name, "[a-zA-Z]+\\."))

#Not including the missing values, give me a summary of the mean, median, and
#standard deviation of Age that is grouped by Pclass, Title, and Sex is male/female?
na.omit(combined) %>%
  group_by(Pclass, Sex, Title) %>%
  summarise(Mean = mean(Age), Median = median(Age), SD = sd(Age), Total=n()) %>%
  filter(Sex == 'male')

na.omit(combined) %>%
  group_by(Pclass, Sex, Title) %>%
  summarise(Mean = mean(Age), Median = median(Age), SD = sd(Age), Total=n()) %>%
  filter(Sex == 'female')

As we can see here, we can infer that people with the title "Master" could be considered as children or young boys. Also, the "Miss" title could be used to represent a more younger group of females. We can probably simplify the title values to four types: Mr , Master , Mrs , and Miss.

#Create a new column called "Title2" which simplified the title values to four types
combined <- mutate(combined, Title2 = 
  ifelse((Sex=='male'), ifelse(Title != 'Master.', 'Mr', 'Master'), 
    ifelse(Title != 'Miss.', 'Mrs', 'Miss')))

Now that we just created a new feature that generalizes the passengers into a certain title, we can use this information to generalize the missing Age values by Title2.

#Create a temporary data frame that contains the mean values of each Title
missingAge <- combined %>%
  group_by(Title2) %>%
  summarize(meanAge = mean(na.omit(Age)))

View(missingAge)
#Using our missingAge dataset, we can join this table into our combined dataset
#and fill in the missing Age values with the meanAge.?
combined <- combined %>%
  left_join(missingAge, by = c("Title2")) %>%
  mutate(Age = ifelse(is.na(Age), meanAge, Age)) %>%
  select(-meanAge)

2.3 Missing Data Imputation (2): Embarked

Looking back at our combined data set, we have two missing values in the Embarked column. We can take a look at this and see how we can impute these missing values.

#Let's see a table summary of the Embarked
table(combined$Embarked)

#Now let's find out where these missing values are located in our dataset
which(is.na(combined$Embarked), arr.ind=TRUE)

#Replace those missing values with 'S' because it was the most used out of them all
combined$Embarked[c(62,830)] <- 'S'

2.4 Missing Data Imputation (3): Fare

We're almost done with our imputation! If you scroll back up to the list of missing values hidden in our combined data set, you can see that we still have Fare and Cabin left. Since Fare only has one missing value, we can replace it with the median.

#Let's find out the median values for Fare and use it as a replacement for the missing value
na.omit(combined) %>% summarize(Median = median(Fare))

#Find where the missing value is located
which(is.na(combined$Fare), arr.ind=TRUE)

#The median turned out to be 57, so we'll use that value to replace it
combined$Fare[c(1044)] <- 57

Let's take a final look at our combined data set and see if we still have any missing values to impute!

#List all the columns and give us a sum of the amount of missing values that are
#in it?
colSums(is.na(combined))

2.5 Missing Data Imputation (4) Cabin:

Notice that our Cabin column has 1014 missing values! That means 3/4 of our values for Cabin are missing! Since there is such a big gap between the missing and non missing values in Cabin, we can just remove the column or not use it at all for our prediction.

#Remove the Cabin column from our combined data set
combined <- combined %>% select(-Cabin)
3. Exploratory Analysis (ggplot2)
Now that our combined data set is cleaned and we feature engineered our first variable: Title, we can know do some exploratory analysis! It's important to understand how our data set correlates with one another and maybe we can find some patterns or meanings on how our variables affect the passengers that Survived/Not Survived.

3.1 Age vs Survived vs Sex:

Let's take a look at our two variables (Sex and Age) and see how they relate to our Survived variable.

#Age vs Survived vs Sex
ggplot(combined[1:891,], aes(Age, fill = factor(Survived))) + 
  geom_histogram(bins=30) + 
  xlab("Age") +
  ylab("Count") +
  facet_grid(.~Sex)+
  scale_fill_discrete(name = "Survived") + 
  ggtitle("Age vs Sex vs Survived")
 
  

We can see here that a great amount of females survived as opposed to males. It could be from the ideology that people favored females to survive. Also, we can see that males from the ages 18-60 had a massive death rate. Males that were below the age of 18 had a considerable survival rate, which is probably due to the fact that they were children.

3.2 PClass vs Survived

Let's take a look at how the passengers in different classes affected their survival rate.

#Pclass vs Survived
ggplot(combined[1:891,], aes(Pclass,fill = factor(Survived))) +
  geom_bar(stat = "count")+
  xlab('PClass') +
  ylab("Count") +
  scale_fill_discrete(name = " Survived") + 
  ggtitle("Pclass vs Survived")
 
  

From this graph, we can speculate that generally passengers in 1st class had more than a 50% survival rate. 2nd Class had about a 50% survival rate. And 3rd Class had below a 50% survival rate. The wealthier the individual meant the higher survival rate.

3.3 Fare vs Survived

Interestingly, we saw people in the higher class had a higher rate of survival. Let's take this speculation even further by graphing how the passenger's Fare prices affected their survival rate.

#Fare vs Survived
ggplot(combined[1:891,], aes(Fare, fill = factor(Survived))) + 
  geom_histogram() + 
  xlab("Fare") +
  ylab("Count") +
  ggtitle("Fare vs Survived")
 
  

From this graph, there seems to be a high correlation between wealth and survival rate. People that paid less than about $50 on their Fare price had less than a 50% chance of survival. As you move towards the right side of the graph, you tend to see a higher survival rate, which we can further support our statement of survival rate favoring the wealthy.

Exploratory Modeling (RandomForest)
Now that we finished our Exploratory Analysis, we can make the general assumption that a high survival rate favored passengers that were:
(1) Females (2) Children (3) Wealth

We can finally do our predictive modeling with the use of Random Forest. But first, we'll have to categorize the variables that we would want to use in our prediction into factors. We'll use Pclass, Sex, Fare, Embarked, Title, and Age as our predicting variables.

#Make Sex, Embarked, Title, and Pclass into factors (categorical variables)
combined$Sex  <- as.factor(combined$Sex)
combined$Embarked  <- as.factor(combined$Embarked)
combined$Title2  <- as.factor(combined$Title2)
combined$Pclass  <- as.factor(combined$Pclass)

#Split our combined data set back into the training set and test set
train <- combined[1:891,]
test <- combined[892:1309,]
#randomForest
library('randomForest')

#Create a random seed
set.seed(1234)

#Choose Pclass, Sex, Fare, Embarked, Title, and Age for our prediction of Survival
rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title2 + Age, data = train)

#Print out our accuracy for the prediction
?print(rf_model)

#Plot our variables in order of importance 
varImpPlot(rf_model, main = "RF_MODEL")
 
  

The estimate rate of error turned out to be 16.5%. Meaning that we had about a 83.5% accuracy rate for our prediction using our test data set! Also, the RF_MODEL table showed us that Title, Fare, and Sex were the top three predictor variables for our prediction. Let's see how our model reflects the predictions on Kaggle's data set!

# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)
solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)

# Write the solution to file
write.csv(solution, file = 'rf_mod_Solution.csv', row.names = F)

So our accuracy went down about 5%, meaning that our model was over fitting the training set by a little bit! I took out the Age variable and submitted the solution again to see if the accuracy would improve and it turned out it did!

In conclusion, this was my first attempt at Machine Learning. I used someo resources from Kaggle's discussion board and Youtube videos to help me throughout this process. The biggest concept that I learned from this was the importance of not over fitting. You should be weary about your accuracy if it is too high on your training set because your training set might not be a real reflection of the real world!






要查看或添加评论,请登录

Randy Lao ??的更多文章

社区洞察

其他会员也浏览了