k-Nearest Neighbors in Machine Learning (k-NN)

k-Nearest Neighbors in Machine Learning (k-NN)

K-Nearest Neighbors is one of the most basic classification algorithms in Machine Learning. Link to Actual Post - Knn

Code and Data-set Link : Github - YadsmiC | Kaggle - StuffByYC

Code and Data-set also available in Downloads section of website - BlogbyYC

In this Article we will understand how K-Nearest Neighbors algorithm works. How to Implement it with Python.

KNN algorithm uses "feature similarity" to predict the values of new data-points based on distance.

To see type of distance used in distance based model go to: Type of Distances used in Machine Learning algorithm

Understanding K-Nearest Neighbor

What is K ?

k is the number of neighbours which will help us to decide the class of the new data-point.

Algorithm Steps:

Step 1: Choose the values of "K" that is. the number of neighbor which will be used to predict the resulting class.

Ex. Lets suppose we choose value of k as 5.

There are 2 classes ["Cat" , "Dog" ]. We have to predict if the new data point belongs to class "Cat" or "Dog".

Step 2: Calculate the distance from the new data point to remaining data points and take k number of the shortest distant neighboring data point. you can choose which distance formula to use. Type of Distances used in Machine Learning algorithm

Ex. Lets suppose the 5 nearest data-point to the New data point "N( x,y ) " are "A", "B", "C", "D", "E" that we calculated using euclidean distance method.

Step 3: After getting the "K" nearest data-points, count the categories of the data point ( That is count how many of those k neighbors belong to which categories).

Ex. The class of data points are

A - "Cat"

B - "Dog"

C - "Cat"

D - "Dog"

E - "Dog"

Now count the categories.

"Cats" - 2 and "Dog" - 3

Step 4: Assign the new data point "N" with the class having maximum categories count

Ex. Here category or class "Dog" has the maximum number of category count. "3" out of "5" of the nearest data-point belong to category "dog"

So the new data point "N" will be assigned to class "Dog"

Let me know in comments if you have any difficulty understanding this.

Implementing K-Nearest Neighbor

Below we will Implement KNN Algorithm using Sklearn Library.

The task is to Predict if the Customer will purchase the product or not

Code and Data-set Link : Github - StuffByYC | Kaggle StuffByYC

Code and Data-set also available in Downloads section of this website - BlogbyYC

You need to install the Sklearn library

Open command prompt

pip install scikit-learn

Step 1: We will Import the Libraries

Step 2: Importing the data ( You can find the Sales.csv file in the Github Link above )

Step 3: Splitting Data into training and testing data

Here Test_size = 0.25. It means the out of the dataset that data Allocated for testing is 25%

Step 4: Feature Scaling

What is Feature Scaling ?

As we have learned above we are using distance to predict a particular class. Now in this particular data-set we have salary with value ranging from thousand to 100 thousand and the value of age is within 100.

As salary has a wide range from 0 to 150,000+. Where as, Age has a range from 0 - 60 the distance calculated will be dominated by salary column which will result in erroneous model so we need feature scaling

Feature Scaling is a technique to standardize/normalize the independent features (data columns) present in the data in a fixed range.

Below is the sample transition of data from

X(data) -> X (After train_test_split) -> X (After feature scaling)

Step 5: Fitting the model

Here we will be using euclidean distance so we have set p = 2

Click on : Type of Distance to know more.

K value is set to 5

Step 6: Predicting the test result using the Classifier / Model

Step 7: Calculating Performance Metrics to Evaluate model

Click on Performance metrics to know more

Confusion matrix

True Positive (TP): Result data is positive, and is predicted to be positive.

False Negative (FN): Result data is positive, but is predicted negative.

False Positive (FP): Result data is negative, but is predicted positive.

True Negative (TN): Result data is negative, and is predicted to be negative.

Click on confusion matrix to know more

Recall - 85.29%

Accuracy - 90.53%

Precision - 87.88%

F1 Score

Have successfully implemented K-Nearest Neighbors in python with the help of scikit-learn Library. check out Advantages and Disadvantages of K-Nearest Neighbors

#yadsmic #blog #knn #MachineLearning #ML #yadnesh #YC #YadsmiC #blogbyyc

要查看或添加评论,请登录

Yadnesh Choudhary的更多文章

  • Microsoft Azure Certifications for Free

    Microsoft Azure Certifications for Free

    Discover Microsoft's free, in-depth, virtual training covering a range of technical topics for Azure, Microsoft 365…

社区洞察

其他会员也浏览了