Classification of Iris Dataset using Logistic Regression
Iris Data Introduction
The Iris dataset is a renowned and pivotal dataset that was first introduced by the statistician and biologist Sir R.A Fisher in 1936 which became a widely using dataset in machine learning and statistics. This dataset got its name because of an iris flower and segregate three species of Iris flowers based on their variation. This specific dataset consists with 150 samples of data based on 3 different types of iris flower. The data consists of 50 samples each from the 3 different types mentioned. ?
1.????? Iris Setosa
2.????? Iris Virginica
3.????? Iris Versicolor, are the mentioned three different type of iris flower.
With related to the each and every different variant of iris flower, we categorize them to three classes and there are 4 primary featured measured for each flower.
·???????? Sepal Length: The length of the flower's sepal.
·???????? Sepal Width: The width of the sepal.
·???????? Petal Length: The length of the flower's petal.
·???????? Petal Width: The width of the petal.
As per the dataset's structure, each entry is an observation made using the above four features measured in centimeters and a target or class variable. target variable takes on the values 0, 1, and 2 that represent the three different types of iris flower which made it’s a perfect choice for educational and illustrative purposes. In the context of machine learning and data mining, the Iris dataset is used to illustrate clustering, classification, and pattern recognition. Which helps k-means clustering to segregate the data into clusters and Classification algorithms like logistic regression, decision tree and Support vector machines also used it extensively.
?
Methods
When come to the classification of iris dataset, there are many algorithms can be seen, KNN or K-Nearest Neighbors, Decision trees and Logistic regression are some of them. In this classification Logistic Regression is used as the method. Logistic Regression is a statistical model primarily used for binary classification problems. It uses the properties of the logistic function to model the likelihood of a category of the variable as a function of the predictor variables.
Logistic Regression is primarily used for binary classification problems, but it can be extended to multiclass classification problems as well. The model will estimate probabilities by using a logistic function, and then used for class predictions. When come to the steps of the logistic regression model coefficient is estimated based on the training data using a method called maximum likelihood estimation which check for the more likely sets. While the prediction process is happening, for each and every instance it calculates the weighted sum of the input features and output the logistic result instead of linear regression which outputs the result directly. The output value also ranges in-between 0 and 1 which helps to determine the class.
When come to the advantages of logistic regression it helps to output the probabilities rather than class prediction. This helps to improve the confidence level made by the dataset, which proves that Logistic regression is a simple and powerful model which can be used to train on datasets with a large number of features.
In the application process of the Logistic regression to the iris dataset, it considers four variables measured from each sample, the lengths and the widths of the sepals and petals, to predict the class of iris species. Then the results output a number from 0 and 2, with each result representing one of the three classes of Iris species. Then it is converted into a class prediction, if greater than 1.5, we can classify the instance as the third class, if not its classified under first or second class.
领英推荐
?
Code
For the coding, python language is used.
For further reference I have uploaded the dataset along with the code: - https://github.com/minidu97/Iris-Dataset-Classification-Using-Logistic-Regression.git
?
Result Analysis
When come to the classification of the dataset of iris by logistic regression, it producing quite a high accuracy on both the training and testing datasets. The training data accuracy is approximately 79.51%, which means that the model correctly predicted the classes almost 80% of the time after assuming to the closest number. The test data accuracy is also higher, which is a positive result. It's around 89.29%, implying that the model was able to generalize well from the training data to unseen data, correctly predicting nearly 90% of test cases. This happens because the test data set has not been used or visible when the training process is going on.
The Decision Function Scores give insight look into the prediction it makes for each class. Each and every row is computed as the dot product of the input values and the corresponding coefficients so that the data point is classified based on which row or the class has the highest value. Based on all the results, can get a glimpse of an idea on the classification which helps to indicate the confident level of the class.
?
?