KNN Points to Keep
All points you will require for kNN at one place.
In next Publish I will decompose kNN mathematical part. Enjoy
Points To Keep:
-- [ k-NN is a?non parametric, instance based, lazy supervised algorithm ] ---
1. kNN algorithm is widely used to benchmark more complex algos like SVM, ANN's, CNN's, Deep Networks
2. It is used in varieties of applications such as
* Data Compression
* Genetics - gene function prediction
* Aviation - air trafic flow prefiction
* Security - detection of distress in mood
------------------------------------------------
* Handwriting Detection (like OCR)
*?Image Recognition
* Video Recognition
* Recommendation Engines - using clickstream data from websites, the KNN algorithm has been used to provide automatic recommendations to users on additional content (not good if data is big)
* Finance - on credit data kNN can help banks assess risk of a loan to an organization or individual, can be used to determine credit-worthiness of a loan applicant
-
1. kNN take more computation on test time rather than train time because
* In Training Phase of kNN it only stores Feature Vectors & Class Labels of training samples
* In Testing Phase, a test point is classified by assigning lavel which are most frequent among training samples (say n-trainig samples) nearest to that query point so this process require more computation
---------------------------------------Computational Complexity for classifying new samples grows linearly with number of samples in training dataset in worst-case scenario
-
2. There are many distance metric which can be used in kNN like:
* Euclidean Distance
* Manhattan/L1 Distance -- coside chess distance between squares on?chessboard?for?rooks?is measured in Manhattan distance
* Hamming distance
* Minkowski Distance
* Jaccard Distance
* Chebyshev Distance -- consider chess kings?and?queens?use?Chebyshev distance
----------------------------------
* Manhaten Distance is designed for calculating distance between real valued features
* Euclidean Distance treats each feature as equally important
----------------------------------
When to used which Distance Matrix:
* In case of "Continuous Variables" ==> Eucliden & Manhattan distances are used
* In case of "Categorical Variables" ==> Hamming distance is used
-
3. kNN algorith can be used for Classification and Regression Problem
In Regression problem case prediction can be based on mean/median pf k-most similar instance
-
4. k-NN performs much better if
* data have same scale
* number of inputs is small and struggles when number of inputs becomes large/very large
-
* k-NN makes no assumptions about function form of problem being solveg i.e Y = f(x)
* One can use k-NN algorith to impute missings for both Continuous variable and Categorical Variables
-
Simple kNN model will lead to Low Variance and Hight Bias
-
kNN is very likely to overfit due to Curse of Dimensionality so one can use either Dimensionality Reduction or Feature Selection to overcome it
---------------------------------------------
Overfitting mean:
model performs well on trainig data but it is not generalized enough to give same result on new/unseen data
-
While selecting "value of k" in kNN point to keep
* if given value of k is "Very Large" -- (Bias will Increase) -- model can include points of other classes into neighborhood
* if given value of k is "Very Small" -- (Variance will Increase) -- model can become very sensitive to noise
------------------------------------------
Decision boundy will become smoother when value of k will be increased
-
If not restricted to a single sample or number of times, one can draw samples from original dataset, a simple variance reduction method would be to "sample many times", and then
* simply take a majority vote of kNN models fit to each of these samples to classify each test data point
* This *variance reduction method* is called?bagging
-
领英推荐
Training time for any value of k in kNN algorithm is same so relation given bellow is false:
1-NN < 2-NN < 3-NN
-
k represents number of nearest neighbours one want to select to predict class of a given item, which is coming as an unseen dataset for model
Effect of changing k
-
To find Optimal value of k in kNN one have to work with different value of k.
* optimal-k value can be obtained by
HYPERPARAMETER TUNING
optimal-k value mostly depends upon dataset itself, based on different data and cases optimal-k value will change, there is no final method here
One can try for:
* SQARE ROOT METHOD : take training data number of samples squre root and assign it to k-value
* CROSS-VALIDATION METHOD : start with min value of k say(k=1) and run cross-validation till you get best accuracy
In general as value of k increases error goes down after each increasing step of k, pick optimal k, known as [Elbow Method]
* Domain Knowledge : you vs you
-
"Odd Values of k" should be preferred over Even Values in order to ensure that there are no ties in voting.
* k shoud be square root of n(number of datapoints in trainig data)
* If Square Root of a number of data points is even, then Add or Subtract one(1) with it to make it odd
-
Steps how kNN Algorith makes predictions on unseen dataset
-------------------------------------------------
Undergiven steps will be repeated during each iteration of algorith
For each unseen data point kNN Classifier will perform these steps:
S1 : calculate distance of all test data points with all training data points and store them
S2: sort all stored distance in Increasing Order
S3: store "k-nearest" points from our training(point) dataset
S4: calculate proportion of each class
S5: assign class with highest proportion
-
Feature Scaling is required to get better performance on kNN Algorithm
----------------------------------------
Say:
* we have dataset with n-datapoints and N-features
* on feature have values in range 0 to 1
* another feature have values in range -100 to 100
now these different ranges will effect formula of distance based metrices like Euclidean Distance, as higher weightage will be given to variables having higher magnitude/range.
-
Time Complexity of kNN Algorithm:
--------------------------------------------
Distance calculation step requires Quadratic Time Complexity and Sorting of calculated distance required an "O(N log N) time"
Combined process will be O(N^3 log N) process, which is a long process
-
Space Complexity of kNN Algorithm:
--------------------------------------------
As kNN stores all pairwise distances, memory is a problem, some machine can crash if data is large
-
kNN is called Lazy Supervised Algorith / Lazy Learner as:
------------------------------------------
when this algorithm gets training data, it never learn and never does modeling at time
* it stores data
Instead of finding any function with the help of training data it followes "INSTANCE Based Learning" and also uses training data whenever it is required to predict on unseen data
-
Disadvantages of kNN Algoritm:
________________________________________________
1. Dont work well with large dataset / hight dimensions
2. Requires Feature Scaling
3. Senstive to Noise and Outliers
-
It is very much possible to use kNN for Image Processing
-----------------------------------------------
Converting a 3-d image into a single-dimensional vector and then give it as input to kNN
Check out these distance based metrics:
Happy Learning