AI_Part_5_K-NN
K Nearest Neighbour (K-NN))

AI_Part_5_K-NN

K-NN Stands for K-Nearest Neighbour.

Let us imagine we have a scenario where we have two categories already present in our dataset.

One is Category A (Green scatter points), and another is Category B (Yellow scatter points).

We take two columns in our dataset x1 and x2. Now we add a new data point to our dataset. The question is, should it fall in the green category or in the yellow category?

This is where we take the help of K-NN.

Few points about K-NN

  1. In the K-NN algorithm, we need to specify the number of neighbors.
  2. K-NN is not a linear classifier.
  3. The k-NN prediction boundary does not look like a smooth curve.
  4. In Python, the class used to create a K-NN classifier is KNeighborsClassifier
  5. Default parameter for the number of neighbors k = 5.


Step-by-step rule guide to K-NN:

Step 1: Choose the number K of neighbors. So we need to identify that k is equal to 1, 2, 3, 5, or some other number. One of the most common default values of k is 5.

Step 2: Take the nearest neighbors of the new data point, according to the Euclidean distance.

Note: We can also use Manhattan distance in place of Euclidean distance.

Euclidean Distance: https://byjus.com/maths/euclidean-distance/

Manhattan Distance: https://www.geeksforgeeks.org/maximum-manhattan-distance-between-a-distinct-pair-from-n-coordinates/

Step 3: Among the K neighbors, count the number of data points in each category.

Note: If we have more than two categories in our dataset, we need to calculate how many fall into each category.

Step 4: Assign the new data point to the category where you counted the most neighbors. That is where it is called K-nearest Neighbours.

Step 5: The model is ready

#K-Nearest Neighbors (K-NN)        

#Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from matplotlib.colors import ListedColormap
from matplotlib.colors import ListedColormap        

#Importing the dataset

dataset = pd.read_csv('ENTER_THE_NAME_OF_YOUR_DATASET_HERE.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values        

#Splitting the dataset into the Training set and Test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
print(X_train)
print(y_train)
print(X_test)
print(y_test)        

#Feature Scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train)
print(X_test)        

#Training the K-NN model on the Training set

classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)        

#Predicting a new result

print(classifier.predict(sc.transform([[30,87000]])))        

#Predicting the Test set results

y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))        

#Making the Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)        

#Visualising the Training set results

X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1),
                     np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('green', 'yellow')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('green', 'yellow'))(i), label = j)
plt.title('K-NN (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()        

#Visualising the Test set results

X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1),
                     np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),
             alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-NN (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()        

要查看或添加评论,请登录

ARNAB MUKHERJEE ????的更多文章

社区洞察

其他会员也浏览了