登录查看更多内容

Machine Learning - Supervised Learning - Classification (I)

?? Fernando Bucci

Head of Gen AI for Software Engineering

发布日期: 2017年12月8日

In this article we will use classification algorithms to predict the species flowers belongs to by knowing petal and sepal widths and lengths.

We will use the Iris dataframe. It is a multivariate dataset introduced by R.A. Fischer in 1936 which contains 50 data points for each one of three different Iris species: Setosa, Versicolor and Virginica.

From left to right, Iris setosa (by Radomil, CC BY-SA 3.0), Iris versicolor (by Dlanglois, CC BY-SA 3.0), and Iris virginica (by Frank Mayfield, CC BY-SA 2.0).

Let's start off by describing the Iris dataset.

Describe Iris Dataset

# load required packages.
install.packages("ggplot2");library("ggplot2");
install.packages("ggExtra");library("ggExtra");
install.packages("gclus");library("gclus");
install.packages("car");library("car");
install.packages("hexbin");library("hexbin");
install.packages("latticeExtra");library("latticeExtra");
install.packages("rgl");library("rgl");
install.packages("e1071");library("e1071");
install.packages("caret");library("caret");
install.packages("LiblineaR");library("LiblineaR");
install.packages("naivebayes");library("naivebayes");
install.packages("randomForest");library("randomForest");

# Display data frame summary
summary(iris)

# Display box plots
par(mfrow=c(1,4))
  for(i in 1:4) {
  boxplot(iris[,i], main=names(iris)[i], col=rainbow(4)[i])
}

# Display scatter plots and covariance beween features.
colors<-c("red","green3","blue")
panel.pearson <- function(x, y, ...) {
  horizontal <- (par("usr")[1] + par("usr")[2]) / 2; 
  vertical <- (par("usr")[3] + par("usr")[4]) / 2; 
  text(horizontal, vertical, format(abs(cor(x,y)), digits=2)) 
}
pairs(iris[1:4], main = "Iris Data", 
  pch=21, bg=colors[unclass(iris$Species)], 
  upper.panel=panel.pearson, oma=c(4,4,6,12))  
par(xpd=TRUE)
legend(0.85,0.6, as.vector(unique(iris$Species)),  
       fill=colors,cex = 0.75)

# Display overlayed density plots
featurePlot(x = iris[, 1:4], 
  y = iris$Species,
  plot = "density", 
  scales = list(x = list(relation="free"), 
    y = list(relation="free")), 
  pch = "|",
  auto.key = list(columns = 3)
)

Pre-process Data

# Look for nearly zero variance fearures
nearZeroVar(iris, saveMetrics= TRUE)

# Check whether our data contains empty values or not 
> anyNA(iris)

[1] FALSE

# In case of empty values KNN could be used for filling in them. For instance: 
# predict(preProcess(iris, method = c("knnImpute")),iris)

# Look for highly correlated features
> findCorrelation(cor(iris[,1:4]), cutoff = .9)

[1] 3


# Remove highly correlated features
iris2=iris[,-3]

# Look for features that can be expressed as a linear combination of others
> findLinearCombos(iris2[,1:3])

$linearCombos
list()

$remove
NULL

# Normalize quantitative features in the range 0:0
iris_norm<-predict(preProcess(iris,method=c("range")),iris)

See preProcess function documentation for further options. For instance, "nvz" and "corr" methods can be used to remove near zero variance and highly correlated features in a single step along with "range", "knnImpute", and other methods.

Peek Algorithms

# List available algorithms
names(getModelInfo())

Additional information on available algorithms can be found in caret documentation.

Create Training And Validation Datasets

# create a list of 80% of the rows in the original dataset
val_index <- createDataPartition(iris_norm$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- iris_norm[-val_index,]
# use the remaining 80% of data to training and testing the models
train <- iris_norm[val_index,]

Instantiate Algorithms And Train

# Validation parameters
ctrl <- trainControl(method="cv", number=10)
me <- "Accuracy"
# Classification parameters
fx <- Species~.

# Instantiate algorithms: Fisher's linear discriminant analysis (LDA)
f_lda<-train(fx,data=train,method="lda",metric=me,trControl=ctrl)
# Regularized Logistic Regression
f_lrl<-train(fx,data=train,method="regLogistic",metric=me,trControl=ctrl)
# Naive Bayes (NB)
f_nb<-train(fx,data=train,method="naive_bayes",metric=me,trControl=ctrl)
# Quadratic Discriminant Analysis
f_qda<-train(fx,data=train,method="qda",metric=me,trControl=ctrl)
# Support Vector Machines (SVM)
f_svm1<-train(fx,data=train,method="svmLinear",metric=me,trControl=ctrl)
f_svm2<-train(fx,data=train,method="svmRadial",metric=me,trControl=ctrl)
f_svm3<-train(fx,data=train,method="svmPoly",metric=me,trControl=ctrl)
# Learning vector quantization (LVQ)
f_lvq<-train(fx,data=train,method="lvq",metric=me,trControl=ctrl)
# Classification and regression trees (CART)
f_cart<-train(fx,data=train,method="rpart",metric=me,trControl=ctrl)
# K-nearest neighbours (KNN)
f_knn<-train(fx,data=train,method="knn",metric=me,trControl=ctrl,
  tuneGrid = expand.grid(k=c(5,7,10,15,19,21))
)
# Random forest
f_rf<-train(fx,data=train,method="rf",metric=me,trControl=ctrl)

Show Accuracy Of The Algorithms

# Summarize accuracy
results <- resamples(list(
  lda=f_lda, cart=f_lrl, nb=f_nb, qda=f_qda, 
  svm1=f_svm1, svm2=f_svm2, svm3=f_svm3, lvq=f_lvq,
  cart=f_cart, knn=f_knn, rf=fit.rf))
summary(results)

# Plot accuracy
bwplot(results, metric = "Accuracy")

Display Model Information

# Print out model information
print(f_lda)

Predict Labels Of Validation Dataset

# Predict labels of validation dataset
predictions <- predict(f_lda, validation)

Generate Confusion Matrix

confusionMatrix(predictions, validation$Species)

Confusion matrix tells us LDA model was able to properly label 29 out of the 30 validation data points, which represents 96.67% of the validation cases. The only data point misclasified was a virginica flower which was labeled as a versicolor.

And voilà!, we have already run several classification algorithms and selected the best one for our problem. Now we can use it for classifying new data points.

Try it, share your experience and provide feedback!

要查看或添加评论，请登录

?? Fernando Bucci的更多文章

Pensando en colores

2023年2月19日

Pensando en colores

En este artículo te contaré cómo, aún hoy, nos seguimos perdiendo en los más básicos razonamientos, cuáles son algunos…

1 条评论
Sustainable IT (I)

2022年12月15日

Sustainable IT (I)

This is the first of a series of articles whose goal is to provide an introduction to the concept of Sustainable IT…
API Design Patterns

2022年11月8日

API Design Patterns

APIs bring significant benefits when used in different scenarios. In this article, the most relevant kinds of scenarios…
Why strategy gurus have lied to us for decades and the truthful truth

2020年10月2日

Why strategy gurus have lied to us for decades and the truthful truth

You must have already pitched upon several strategy experts and gurus explaining with pride the process for defining…

3 条评论
What if Histiaeus used WhatsApp?

2018年1月12日

What if Histiaeus used WhatsApp?

Steganography is the practice of concealing the fact that a secret message is being sent as well as the contents of the…
Notes on Hack the Box

2018年1月11日

Notes on Hack the Box

Hack the Box is an online platform allowing to you test your penetration testing skills. The first challenge you face…

1 条评论
Machine Learning - Some basic definitions

2017年12月8日

Machine Learning - Some basic definitions

Machine learning is a branch in computer science that studies the design and use of algorithms and models that can…
Machine Learning - Data visualization with R (III)

2017年12月7日

Machine Learning - Data visualization with R (III)

This article continues presenting different techniques that can be used to communicate data or information by encoding…
Machine Learning - Data visualization with R (II)

2017年12月7日

Machine Learning - Data visualization with R (II)

This article continues presenting different techniques that can be used to communicate data or information by encoding…
Machine Learning - Data visualization with R (I)

2017年12月6日

Machine Learning - Data visualization with R (I)

This article presents different techniques that can be used to communicate data or information by encoding it in…

See all articles

Machine Learning - Supervised Learning - Classification (I)

?? Fernando Bucci

Head of Gen AI for Software Engineering

Describe Iris Dataset

Pre-process Data

Peek Algorithms

Create Training And Validation Datasets

Instantiate Algorithms And Train

Show Accuracy Of The Algorithms

Display Model Information

Predict Labels Of Validation Dataset

Generate Confusion Matrix

?? Fernando Bucci的更多文章

社区洞察

其他会员也浏览了

Five Things: AI, Forests, Catfish, Hummus, Chips

SuperMap GIS 2024, Upgrading Geospatial AI to Empower New Productivity

Data Science Lessons from the Apollo Moon Landing

Developing an Artificial Intelligence (AI) model to predict cyclones & further!

Searching for the Lone Needles in a Cosmos of Haystacks

When artificial intelligence meets snow: new horizons for hydrology

How Geospatial Artificial Intelligence Will Revolutionize the World for the Better

PDA Series #1 Recovering Missing Sonic Logs

Another Lawsuit Rocks the Generative AI World

Unsilencing Nature: A.I. Amplifying the Voices of Insects (Plus Their $200 Billion Contribution To Our Economy)

Describe Iris Dataset

Pre-process Data

Peek Algorithms

Create Training And Validation Datasets

Instantiate Algorithms And Train

Show Accuracy Of The Algorithms

Display Model Information

Predict Labels Of Validation Dataset

Generate Confusion Matrix

?? Fernando Bucci的更多文章

Pensando en colores

Sustainable IT (I)

API Design Patterns

Why strategy gurus have lied to us for decades and the truthful truth

What if Histiaeus used WhatsApp?

Notes on Hack the Box

Machine Learning - Some basic definitions

Machine Learning - Data visualization with R (III)

Machine Learning - Data visualization with R (II)

Machine Learning - Data visualization with R (I)

社区洞察

其他会员也浏览了

Five Things: AI, Forests, Catfish, Hummus, Chips

SuperMap GIS 2024, Upgrading Geospatial AI to Empower New Productivity

Data Science Lessons from the Apollo Moon Landing

Developing an Artificial Intelligence (AI) model to predict cyclones & further!

Searching for the Lone Needles in a Cosmos of Haystacks

When artificial intelligence meets snow: new horizons for hydrology

How Geospatial Artificial Intelligence Will Revolutionize the World for the Better

PDA Series #1 Recovering Missing Sonic Logs

Another Lawsuit Rocks the Generative AI World

Unsilencing Nature: A.I. Amplifying the Voices of Insects (Plus Their $200 Billion Contribution To Our Economy)