登录查看更多内容

Day 7: k-Nearest Neighbors (k-NN)

Suraj Kumar Soni

Data Analyst @ Web Spiders Group | Former Digital & Data Analyst @ Digital Hashtag | Tech Writer??| IBM Certified Data Scientist | Machine Learning and AI ??|??Transforming Data into Insights | Data Storytelling ??

发布日期: 2024年8月5日

K-Nearest Neighbors (KNN) is a simple, instance-based learning algorithm used for both classification and regression tasks. The main idea is to predict the value or class of a new sample based on the K closest samples (neighbours) in the training dataset.

For classification, the predicted class is the most common class among the K nearest neighbours. For regression, the predicted value is the average (or weighted average) of the values of the K nearest neighbours.

Key points:

Distance Metric: Common distance metrics include Euclidean distance, Manhattan distance, and Minkowski distance.
Choosing K: The value of K is a crucial hyperparameter that needs to be chosen carefully. Smaller K values can lead to noise sensitivity, while larger K values can smooth out the decision boundary.

Implementation Example

Suppose we have a dataset that records features like sepal length and sepal width to classify the species of iris flowers.

领英推荐

5 MUST KNOW QUESTIONS FOR A DATA SCIENTIST

Shivam Modi 2 年前

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2]  # Using sepal length and sepal width as features
y = iris.target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Creating and training the KNN model with k=5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
    h = .02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8)

    sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
    plt.xlabel('Sepal Length')
    plt.ylabel('Sepal Width')
    plt.title('KNN Decision Boundary')
    plt.show()

plot_decision_boundary(X_test, y_test, model)

Explanation of the Code

Libraries: We import the necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
Data Preparation: We use the Iris dataset and select sepal length and sepal width as features.
Train-Test Split: We split the data into training and testing sets.
Model Training: We create a KNeighborsClassifier model with k=5k=5k=5 and train it using the training data.
Predictions: We use the trained model to predict the species of iris flowers for the test set.
Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
Visualization: We plot the decision boundary to visualize how the KNN classifier separates the classes.

Evaluation Metrics

Accuracy: The proportion of correctly classified instances among the total instances.
Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
Classification Report: Provides precision, recall, F1 score, and support for each class.

Decision Boundary

The decision boundary plot helps to visualize how the KNN classifier separates the different classes in the feature space. KNN decision boundaries can be quite complex, reflecting the non-linear separability of the data.

KNN is intuitive and simple but can be computationally expensive, especially with large datasets, since it requires storing and searching through all training instances during prediction. The choice of K and the distance metric is critical to the model's performance.

Day 7: k-Nearest Neighbors (k-NN)

Suraj Kumar Soni

Data Analyst @ Web Spiders Group | Former Digital & Data Analyst @ Digital Hashtag | Tech Writer??| IBM Certified Data Scientist | Machine Learning and AI ??|??Transforming Data into Insights | Data Storytelling ??

Implementation Example

领英推荐

Explanation of the Code

Evaluation Metrics

Decision Boundary

Data Is Everything

1,963 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Decision Tree

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Kfold Cross Validation for the LightGBM Classifier

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting

Machine Learning Unveils House Price Predictions!

KNN for regression, rather than classification, problems.

L1, L2 Regularization – Why needed/What it does/How it helps?

The Ultimate Guide to Scaling Data in Data Science

Explain Different Types of Kernel in SVM (Support Vector Machine)

Implementation Example

领英推荐

Explanation of the Code

Evaluation Metrics

Decision Boundary

Data Is Everything

1,963 位关注者

Understanding the Differences: Pandas vs SQL

2024年10月21日

Difference between UNION & UNION ALL in SQL?

2024年10月2日

Day 6: Support Vector Machines (SVM)

2024年7月20日

Day 5: Gradient Boosting

2024年7月19日

30-Day Roadmap to Learn SQL for Data Analysis

2024年7月14日

Day 4: Random Forest

2024年7月10日

Day 3: Decision Trees

2024年7月9日

Day 2: Logistic Regression

2024年7月5日

Day 1: Linear Regression

2024年7月4日

Steps to Start Getting Interview Calls from LinkedIn

2024年7月4日

社区洞察

其他会员也浏览了

Decision Tree

Why should treat outliers with Nearest Neighbor and Local Outlier Factor?

Kfold Cross Validation for the LightGBM Classifier

Support Vector Machine- Simple analysis

Model Dimensionality and Overfitting

Machine Learning Unveils House Price Predictions!

KNN for regression, rather than classification, problems.

L1, L2 Regularization – Why needed/What it does/How it helps?

The Ultimate Guide to Scaling Data in Data Science

Explain Different Types of Kernel in SVM (Support Vector Machine)