A Walkthrough of a Machine Learning Task with Scikit-Learn

In this article, we'll walk through a common task for data scientists: training a machine learning model to classify handwritten digits. We'll be using the digits dataset, a collection of 8x8 images of digits that comes pre-loaded with the Scikit-learn library, and the Logistic Regression model for our classification task.

Loading and Understanding the Data

The first step in any machine learning task is to load and understand the data. We'll use Scikit-learn's?load_digits?function to load the digits dataset.

from sklearn.datasets import load_digits?
digits = load_digits()?        

The digits dataset consists of 1,797 8x8 images. Each image is represented as a flat array of 64 pixel values, and each pixel value is a grayscale intensity between 0 and 16. The target vector (i.e., the labels) consists of the digit each image represents (from 0 to 9).

Preprocessing the Data

Once we've loaded the data, the next step is to preprocess it. For this dataset, the only preprocessing step we need is to standardize the feature matrix so that it has a mean of 0 and a standard deviation of 1. Standardizing the data can help improve the performance of many machine learning models.

from sklearn.preprocessing import StandardScaler?
X = StandardScaler().fit_transform(digits.data)?
y = digits.target?        

We'll also split the data into a training set and a test set. The training set will be used to train the model, and the test set will be used to evaluate its performance.

from sklearn.model_selection import train_test_split?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)?        

Building and Training the Model

Now that our data is prepared, we can build and train our model. We'll use a Logistic Regression model, which is a simple yet effective model for classification tasks.

from sklearn.linear_model import LogisticRegression?
model = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=1000)?
model.fit(X_train, y_train)?        

Evaluating the Model

After the model has been trained, we need to evaluate its performance. We'll use the test set for this purpose. The model's predictions on the test set can be compared to the actual labels to compute metrics such as accuracy.

from sklearn.metrics import accuracy_score, confusion_matrix?
y_pred = model.predict(X_test)?
accuracy = accuracy_score(y_test, y_pred)?        

In this case, our model achieved an accuracy of approximately 97.22%, which is quite good.

To get more insight into the model's performance, we can also look at the confusion matrix, which shows the number of times each class was predicted for each actual class.

confusion_mat = confusion_matrix(y_test, y_pred)?        

The confusion matrix for our model shows that it has performed quite well, with only a few misclassifications.

Visualizing the Results

Finally, we can visualize the results of our analysis. One useful visualization is a plot of the confusion matrix, which can help us better understand the performance of our model.

import matplotlib.pyplot as plt?
plt.imshow(confusion_mat, interpolation='nearest', cmap=plt.cm.Blues)?
plt.title('Confusion matrix for Logistic Regression on Digits dataset')?
plt.xticks(range(10), rotation=45)?
plt.ylabel('True label')?
plt.xlabel('Predicted label')?

This heatmap shows the number of times each class was predicted for each actual class. The color intensity in the plot is indicative of the count, with darker shades representing higher counts. The numbers in the cells are the exact counts.

This walkthrough has shown a simple example of a machine learning task using Scikit-learn. Real-world projects can be much more complex and may involve additional steps such as feature engineering, hyperparameter tuning, model selection, and more. However, the basic steps - loading the data, preprocessing the data, building and training the model, evaluating the model, and visualizing the results - remain the same.

from sklearn.datasets import load_digits?
from sklearn.preprocessing import StandardScaler?
from sklearn.metrics import classification_report?
from sklearn.decomposition import PCA?
# Load digits dataset?
digits = load_digits()?
# Standardize the feature matrix?
X = StandardScaler().fit_transform(digits.data)?
# The target vector?
y = digits.target?
# Split the data into a training set and a test set?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)?
# Initialize the model?
model = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=1000)?
# Train the model?
model.fit(X_train, y_train)?
# Predict the labels of the test set?
y_pred = model.predict(X_test)?
# Compute the accuracy and print the confusion matrix?
accuracy = accuracy_score(y_test, y_pred)?
confusion_mat = confusion_matrix(y_test, y_pred)?
accuracy, confusion_mat?        


 array([[33,  0,  0,  0,  0,  0,  0,  0,  0,  0],?
        [ 0, 28,  0,  0,  0,  0,  0,  0,  0,  0],?
        [ 0,  0, 33,  0,  0,  0,  0,  0,  0,  0],?
        [ 0,  0,  0, 33,  0,  1,  0,  0,  0,  0],?
        [ 0,  1,  0,  0, 45,  0,  0,  0,  0,  0],?
        [ 0,  0,  0,  0,  0, 44,  1,  0,  0,  2],?
        [ 0,  0,  0,  0,  0,  1, 34,  0,  0,  0],?
        [ 0,  0,  0,  0,  0,  0,  0, 33,  0,  1],?
        [ 0,  0,  0,  0,  0,  1,  0,  0, 29,  0],?
        [ 0,  0,  0,  1,  0,  0,  0,  0,  1, 38]]))?        

We successfully trained a Logistic Regression model on the digits dataset and evaluated its performance on the test set. The model achieved an accuracy of approximately 97.22%. The confusion matrix is a useful tool for understanding the performance of the model across different classes.

Each row of the matrix represents the instances of an actual class, while each column represents the instances of a predicted class. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while anything off the diagonal indicates misclassified instances. As we can see, our model has performed quite well, with only a few misclassifications. Let's visualize the confusion matrix to make it more interpretable.

# Plotting the confusion matrix?
plt.figure(figsize=(10, 7))?
plt.imshow(confusion_mat, interpolation='nearest', cmap=plt.cm.Blues)?
plt.title('Confusion matrix for Logistic Regression on Digits dataset')?
tick_marks = np.arange(10)?
plt.xticks(tick_marks, range(10), rotation=45)?
plt.yticks(tick_marks, range(10))?
# Loop over data dimensions and create text annotations.?
for i, j in np.ndindex(confusion_mat.shape):?
    plt.text(j, i, confusion_mat[i, j], ha="center", va="center", color="red")?
plt.ylabel('True label')?
plt.xlabel('Predicted label')?

Here's the visual representation of the confusion matrix. Each cell in the plot corresponds to a count of how often the predicted class (x-axis) matches the actual class (y-axis). For example, a cell at coordinates (3,3) shows the count of images that were actually of class 3 and were also predicted as class 3 by the model.

No alt text provided for this image

The color intensity in the plot is indicative of the count, with darker shades representing higher counts. The numbers in the cells are the exact counts.

As we can see, the model has performed quite well as the darker cells are mostly on the diagonal, which means that the true positives and true negatives are high.

This is one form of visualizing the performance of a machine learning model. The choice of visualizations will depend on the nature of the data and the specific needs of the analysis. Other forms of visualization could include ROC curves, precision-recall curves, etc.

If this were a real-world project, the next steps would involve further refining the model (e.g., tuning hyperparameters, trying different models) based on the results of this initial analysis. The goal would be to maximize the performance of the model, subject to the constraints of the project (e.g., computational resources, time).


Brecht Corbeel的更多文章

