Machine Learning - Supervised Learning - Classification (I)
In this article we will use classification algorithms to predict the species flowers belongs to by knowing petal and sepal widths and lengths.
We will use the Iris dataframe. It is a multivariate dataset introduced by R.A. Fischer in 1936 which contains 50 data points for each one of three different Iris species: Setosa, Versicolor and Virginica.
From left to right, Iris setosa (by Radomil, CC BY-SA 3.0), Iris versicolor (by Dlanglois, CC BY-SA 3.0), and Iris virginica (by Frank Mayfield, CC BY-SA 2.0).
Let's start off by describing the Iris dataset.
Describe Iris Dataset
# load required packages.
install.packages("ggplot2");library("ggplot2");
install.packages("ggExtra");library("ggExtra");
install.packages("gclus");library("gclus");
install.packages("car");library("car");
install.packages("hexbin");library("hexbin");
install.packages("latticeExtra");library("latticeExtra");
install.packages("rgl");library("rgl");
install.packages("e1071");library("e1071");
install.packages("caret");library("caret");
install.packages("LiblineaR");library("LiblineaR");
install.packages("naivebayes");library("naivebayes");
install.packages("randomForest");library("randomForest");
# Display data frame summary
summary(iris)
# Display box plots
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(iris[,i], main=names(iris)[i], col=rainbow(4)[i])
}
# Display scatter plots and covariance beween features.
colors<-c("red","green3","blue")
panel.pearson <- function(x, y, ...) {
horizontal <- (par("usr")[1] + par("usr")[2]) / 2;
vertical <- (par("usr")[3] + par("usr")[4]) / 2;
text(horizontal, vertical, format(abs(cor(x,y)), digits=2))
}
pairs(iris[1:4], main = "Iris Data",
pch=21, bg=colors[unclass(iris$Species)],
upper.panel=panel.pearson, oma=c(4,4,6,12))
par(xpd=TRUE)
legend(0.85,0.6, as.vector(unique(iris$Species)),
fill=colors,cex = 0.75)
# Display overlayed density plots
featurePlot(x = iris[, 1:4],
y = iris$Species,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
pch = "|",
auto.key = list(columns = 3)
)
Pre-process Data
# Look for nearly zero variance fearures
nearZeroVar(iris, saveMetrics= TRUE)
# Check whether our data contains empty values or not
> anyNA(iris)
[1] FALSE
# In case of empty values KNN could be used for filling in them. For instance:
# predict(preProcess(iris, method = c("knnImpute")),iris)
# Look for highly correlated features
> findCorrelation(cor(iris[,1:4]), cutoff = .9)
[1] 3
# Remove highly correlated features
iris2=iris[,-3]
# Look for features that can be expressed as a linear combination of others
> findLinearCombos(iris2[,1:3])
$linearCombos
list()
$remove
NULL
# Normalize quantitative features in the range 0:0
iris_norm<-predict(preProcess(iris,method=c("range")),iris)
See preProcess function documentation for further options. For instance, "nvz" and "corr" methods can be used to remove near zero variance and highly correlated features in a single step along with "range", "knnImpute", and other methods.
Peek Algorithms
# List available algorithms
names(getModelInfo())
Additional information on available algorithms can be found in caret documentation.
Create Training And Validation Datasets
# create a list of 80% of the rows in the original dataset
val_index <- createDataPartition(iris_norm$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- iris_norm[-val_index,]
# use the remaining 80% of data to training and testing the models
train <- iris_norm[val_index,]
Instantiate Algorithms And Train
# Validation parameters
ctrl <- trainControl(method="cv", number=10)
me <- "Accuracy"
# Classification parameters
fx <- Species~.
# Instantiate algorithms: Fisher's linear discriminant analysis (LDA)
f_lda<-train(fx,data=train,method="lda",metric=me,trControl=ctrl)
# Regularized Logistic Regression
f_lrl<-train(fx,data=train,method="regLogistic",metric=me,trControl=ctrl)
# Naive Bayes (NB)
f_nb<-train(fx,data=train,method="naive_bayes",metric=me,trControl=ctrl)
# Quadratic Discriminant Analysis
f_qda<-train(fx,data=train,method="qda",metric=me,trControl=ctrl)
# Support Vector Machines (SVM)
f_svm1<-train(fx,data=train,method="svmLinear",metric=me,trControl=ctrl)
f_svm2<-train(fx,data=train,method="svmRadial",metric=me,trControl=ctrl)
f_svm3<-train(fx,data=train,method="svmPoly",metric=me,trControl=ctrl)
# Learning vector quantization (LVQ)
f_lvq<-train(fx,data=train,method="lvq",metric=me,trControl=ctrl)
# Classification and regression trees (CART)
f_cart<-train(fx,data=train,method="rpart",metric=me,trControl=ctrl)
# K-nearest neighbours (KNN)
f_knn<-train(fx,data=train,method="knn",metric=me,trControl=ctrl,
tuneGrid = expand.grid(k=c(5,7,10,15,19,21))
)
# Random forest
f_rf<-train(fx,data=train,method="rf",metric=me,trControl=ctrl)
Show Accuracy Of The Algorithms
# Summarize accuracy
results <- resamples(list(
lda=f_lda, cart=f_lrl, nb=f_nb, qda=f_qda,
svm1=f_svm1, svm2=f_svm2, svm3=f_svm3, lvq=f_lvq,
cart=f_cart, knn=f_knn, rf=fit.rf))
summary(results)
# Plot accuracy
bwplot(results, metric = "Accuracy")
Display Model Information
# Print out model information
print(f_lda)
Predict Labels Of Validation Dataset
# Predict labels of validation dataset
predictions <- predict(f_lda, validation)
Generate Confusion Matrix
confusionMatrix(predictions, validation$Species)
Confusion matrix tells us LDA model was able to properly label 29 out of the 30 validation data points, which represents 96.67% of the validation cases. The only data point misclasified was a virginica flower which was labeled as a versicolor.
And voilà!, we have already run several classification algorithms and selected the best one for our problem. Now we can use it for classifying new data points.
Try it, share your experience and provide feedback!