Day 7: k-Nearest Neighbors (k-NN)

Day 7: k-Nearest Neighbors (k-NN)

K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. The main idea is to predict the value or class of a new sample based on the K closest samples (neighbours) in the training dataset.

For classification, the predicted class is the most common class among the K nearest neighbours. For regression, the predicted value is the average (or weighted average) of the values of the K nearest neighbours.

Key points:

  • Distance Metric: Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
  • Choosing K: The value of K is a crucial hyperparameter that needs to be chosen carefully. Smaller K values can lead to noise sensitivity, while larger K values can smooth out the decision boundary.

Implementation Example

Suppose we have a dataset that records features like sepal length and sepal width to classify the species of iris flowers.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2]  # Using sepal length and sepal width as features
y = iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Creating and training the KNN model with k=5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
    h = .02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8)

    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
    plt.xlabel('Sepal Length')
    plt.ylabel('Sepal Width')
    plt.title('KNN Decision Boundary')
    plt.show()

plot_decision_boundary(X_test, y_test, model)        

Explanation of the Code

  1. Libraries: We import the necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
  2. Data Preparation: We use the Iris dataset and select sepal length and sepal width as features.
  3. Train-Test Split: We split the data into training and testing sets.
  4. Model Training: We create a KNeighborsClassifier model with k=5k=5k=5 and train it using the training data.
  5. Predictions: We use the trained model to predict the species of iris flowers for the test set.
  6. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
  7. Visualization: We plot the decision boundary to visualize how the KNN classifier separates the classes.

Evaluation Metrics

  • Accuracy: The proportion of correctly classified instances among the total instances.
  • Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
  • Classification Report: Provides precision, recall, F1 score, and support for each class.

Decision Boundary

The decision boundary plot helps to visualize how the KNN classifier separates the different classes in the feature space. KNN decision boundaries can be quite complex, reflecting the non-linear separability of the data.

KNN is intuitive and simple but can be computationally expensive, especially with large datasets, since it requires storing and searching through all training instances during prediction. The choice of K and the distance metric is critical to the model's performance.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了