KNN - K Nearest Neighbour
Whats the need:
KNN comes under Supervised Learning Algorithms. Its used where there is NON-Linear Regression and hence Logistic Regression cannot be used. The need for this arises since for some cases (as shown in diagram) the classification cannot be achiveved by a single straight line.
How does the algorithm work:
KNN: K Nearest Neighbours, is one of the simplest Supervised Machine Learning Algorithm which is mainly used for classification. It classifies a data point based on how its neighbours are classified. It works on the 'Distance' principle.
KNN is based on Feature Similarity. The 'K' refers to the number of neighbours we want to include in the Majority of the process
In the diagram, if we choose K = 3, then the '?' will be labelled as Square since the in circle K=3, the squares are in majority. If we choose K=7, then '?' will be Triangle since Triangles are in Majority in K=7.
The distance between two points is calculated by Euclidean Distance for Continous variables. For Categorical variables, Hamming Distance is used.
To choose K (for a starting point):
- Take the Square root of n where n is the total number of data point
- Take an odd value of K
When choosing the value of K, keep in mind that if value of K is too small, neighborhood is sensitive to noise points and if the value of K is too large, neighborhood may include points from other classes
Feature Scaling
Feature Scaling is of prime importance to ensure that one featture doesn't overshadow the other feature. Any algorithm which considers distance, has to be scalled. KNN is no exception to this.
For example: If there are three varibales - Age (10-100 years), Weight (10-120kg), Salary (3,00,000 - 30,00,000 INR). In this case, more of the clusters will be generated based on the last feature i.e. Age. To avoid this miss-classification, we shoud normalize the feature variables. Any algorithm where distance plays a vital role for prediction or classification, we need to do Feature Scaling.
Algorithm in R:
For KNN, install the package "Class" in R. In R, we can train and test both in the same line.
Steps to be followed:
1. Split the data into Testa and Train
2. Feature Scale
3. Fit KNN to the training dataset and predict the test set
4. Model Evaluation - Choosing the Right K (Parameter Tuning)
Challenges with KNN
- Scaling issue which can be overcome by doing feature scaling
- Choosing the Right K (Mostly take an odd number for the value of K)