Supervised Machine Learning With Python: Classification. Gaussian Na?ve Bayes
Shivek Maharaj
Data Analyst | Automation Architect | Business success doesn’t follow a blueprint, It follows me | AI Engineer
Implementing supervised learning for classification problems will be the main topic of the next few posts.
The classifier method or model seeks to draw some inference from the values that were observed. We have categorized output in classification problems, such as “Black” or “White” or “Teaching” and “Non-Teaching” We need a training dataset with data points and the appropriate labels so that we can develop the classification model. If we want to determine whether the image is of a car or not, for instance. We will create a training dataset with the two classes “vehicle” and “no car” in it to verify this. The model must then be trained using the training data. The classification methods are primarily utilized in spam identification and facial recognition.
For convenience purposes, the examples that we work with will utilize SciKit Learn’s built-in datasets.
The General Framework For building a classification model in Python
The below phases make up the general framework that a Data Scientist or Machine Learning Engineer needs to follow when building a classification model in Python, using the SciKit Learn package.
PHASE 1: IMPORT THE NECESSARY PACKAGES
This would be the very first step in creating a Python classifier. One of the most popular machine learning modules for Python is the powerhouse package, Scikit-learn. We may import the package using the following Python command:
import sklearn
PHASE 2: SELECT, AND LOAD THE DATASET INTO SYSTEM MEMORY
We may start utilizing the dataset for our machine learning model in this stage. For our first classifier, we will utilize the Breast Cancer Wisconsin Diagnostic Database available to us via the sklearn.datasets package. The collection contains numerous details about breast cancer tumors, together with labels designating whether they are malignant or benign. The dataset contains information on 30 variables, or features, such as the radius of the tumor, texture, smoothness, and area, and contains 569 instances, or data, on 569 tumors. We may import the Scikitlearn breast cancer dataset by using the following Python command:
dataset = load_breast_cancer()
Now, before we move forward, I would like to explain a few concepts about our dataset variable. First, I want you to check the type of this object in python.
print(type(dataset))
The output to the above line of code will show as follows:
Notice, that the type of this object is Bunch. A Bunch is essentially like a dictionary in Python. To learn more about the Bunch type in Python, you may install the powerful dependency for Bunch support as follows:
pip install bunch
The dataset variable which is storing the contents of our Breast Cancer dataset is a very interesting object. This is because it is not a dataframe, nor is it an array. It is a Bunch, as we have mentioned before. However, this Bunch is essentially a dictionary and is storing very interesting information for us. Next, I would like you to see the information that is being stored by the dataset variable. Head to your Python IDE and print the contents of the dataset variable to the screen.
print(dataset)
The output to the above line of code will show as follows. The below image displays a truncated output snippet:
As we can see in the image, this object is storing all our data in a dictionary-like object.
The Bunch object, dataset, has a few important methods that may be called on it. These methods will enable us to effectively analyze and understand the dataset that we are working with. For example, if you would like to obtain insight into the target values of our dataset, you may utilize and call the method .target_names on the Bunch object (in our case, dataset). For example:
print(dataset.target_names)
The output to the above line of code will show as follows:
If you would like to view the target vector, you may utilize and call the .target method on the Bunch object as follows:
print(dataset.target)
If you would like to view the header row of the features matrix, you may call the .feature_names method on the Bunch object as follows:
print(dataset.feature_names)
The output to the above line of code will show as follows:
If you would like to make the column names more readable, you may iterate through the list using a for loop:
for i in dataset.feature_names:
print(i)
The output to the above code block will show as follows:
Finally, if you would like to view the data itself, i.e., all the values contained in the features matrix, you may call the .data method on the Bunch object as follows:
领英推荐
print(dataset.data)
The output to the above line of code will show as follows. It is good to note that the output is truncated:
Now that we effectively understand how the data is structured inside the bunch object, we are able to allocate the data to brand-new variables for each significant collection of data inside the Bunch. In other words, the following lines of code can be used to organize the data:
target_labels = dataset["target_names"]
feature_names = dataset["feature_names"]
target_vector = dataset["target"]
features_matrix = dataset["data"]
Thereafter, we may print the contents of each of the above variables:
print(target_labels)
print(feature_names)
print(target_vector)
print(features_matrix)
The output to the above block of code will show as follows:
Thus based on the insights that we have obtained by structuring our dataset, we can see that the first observation in the dataset is a record belonging to a malignant tumor and its radius is 1.799e+01.
PHASE 3: ORGANIZING THE DATA INTO TESTING AND TRAINING SETS
A training set and a testing set will be created from our data in this stage. We must test our model on the unobserved data, hence it is crucial to divide the data into these sets. Sklearn contains a function named train_test_split() that divides the data into sets. We can divide the data into these sets using the following commands:
from sklearn.model_selection import train_test_split
The program below will divide the data into training and test data after the command above imports the .train_test_split() function from the sklearn.model_selection package. The example below shows how we would use the remaining data to train the model while using 40% of the data for testing.
train, test, train_labels, test_labels = train_test_split(features_matrix, target_vector, test_size = 0.40, random_state=None)
PHASE 4: BUILDING THE CLASSIFICATION MODEL
We will construct our model in this stage. The Na?ve Bayes technique will be used to generate the model. The Gaussian Na?ve Bayes model creation may be commanded with the lines of code below:
from sklearn.naive_bayes import GaussianNB
Thereafter, we are required to instantiate an object of the GaussianNB class as follows:
algorithm = GaussianNB()
Next, we will proceed to train the model as follows:
model = algorithm.fit(train, train_labels)
And with that, our Gaussian Na?ve Bayes classification model has been trained and now resides in our system memory.
PHASE 5: EVALUATING MODEL PERFORMANCE
By making predictions based on our test data, we will assess the model in this phase. Then, we can ascertain its accuracy. We’ll utilize the .predict() method to make predictions. You can achieve this by using the aforementioned command, and in Python, it is simple and would be done as follows:
predictions = model.predict(test)
print(predictions)
The output to the above block of code will show as follows:
To evaluate the performance of our model, we will utilize the .accuracy_score() function available to us from the sklearn.metrics package. The above series of 0s and 1s in the image represent the expected values for the malignant and benign tumor classes, respectively. We can now determine the accuracy of our model by comparing the two arrays, test_labels and predictions.
from sklearn.metrics import accuracy_score
Finally, we proceed to calculate the accuracy of our Gaussian Na?ve Bayes model as follows:
print(accuracy_score(test_labels, predictions))
The output to the above line of code which shows us our models accuracy will display as follows:
The output in the image above depicts the accuracy of our Na?ve Bayes classifier which lies at approximately 92.105%.
Thus, we have effectively learned about the general framework for developing a Machine Learning Classification Model in Python Programming Language.
This effectively concludes this tutorial, and I do hope that you have new takeaways about Machine Learning Classification with Python.
I thank you for your time.