K-Nearest Neighbors Algorithm: Technical application of the phrase, "Birds of a feather flock together"
Dr. Sheetal Sippy
Program Strategy-Life Sciences and Health Care, Deloitte India (Offices of the US)
We often tend to group individuals depicting certain traits/living in similar vicinities together. KNN model works just like our common intuition. It is a supervised machine learning algorithm often used in classification problems. KNN algorithm classifies the data points based on how the neighboring data is classified. Pause! Let us unpack that.
Breaking it down
The supervised learning algorithm depends on labeled input data to learn a function that produces an output when given new unlabeled data. An Unsupervised machine-learning algorithm uses input data without any labels and learns a function that enables users to make predictions. This depends on the basic structure of the data to generate more insights.
K Nearest Neighbors
KNN algorithm classifies the data points based on how the neighboring data is classified. It is simple to implement and in certain cases is a predecessor to/benchmark for more complicated classifiers like Artificial Neural Networks (ANN) and Support Vector Machines (SVM). KNN is a lazy learning algorithm i.e. it memorizes the training dataset and does not have a training phase. Historic and raw data is used from the public domain to draw predictions hence, KNN is also known as, “Case-based learning algorithm.” KNN is a result of research conducted for the armed forces. Fix and Hodge, two officers of USAF School of Aviation wrote a technical report introducing the KNN algorithm in 1951. They introduced this approach to nonparametric classification by relying on the ‘distance’ between points or distributions.
Let us take an example to understand how this works by taking a trip to Karen’s new pet store called Pawsome. Since it is a new store, Karen is trying to engage with her customers through several promotional activities. She anticipates a high footfall on the weekend and lays out dog treats in the form of a puzzle.
The treats were kept on display in the following manner and some of them were hidden with a cloth. She is willing to offer a discounted price on the treats to the ones who solve can guess which shape category the hidden treats belong to. You can now take a minute and try if you get it all right. If you are unable to classify, you can mark it as, “Unsure”
Once you’ve predicted, let’s cross-check it with that of below:
1 & 2: Circular treats
3: Unsure if it is a Circular treat or a bone-shaped treat
4 & 5: Bone-shaped treat 6 & 7: Unsure if it is a bone-shaped treat or a heart-shaped treat
8: Heart-shaped treat
If you have guessed it right, you have implemented KNN!
In the image, we can see that similar treats are arranged together. 1, 2 can be easily classified as, “Circular treats” as they are completely surrounded by them and there is a higher probability that the hidden ones can be circular too. The same logic applies to 4, 5, and 8. Therefore, one can conclude that the hidden ones will mostly be the same type as that of their neighbors. Classification of 3 is tricky as its neighbors are circular treats and bone-shaped ones. There is a similar dilemma for 6 and 7. Hence, 6 and 7 can either be a bone-shaped treat or a heart-shaped one. From this, it is safe to say that the KNN algorithm predicts the label for a new point on the basis of the label of its neighbors. Through the above example, based on the label (Circular treat, Bone-shaped teat, and Heart-shaped treat) of the neighbors we can classify the new data point. (Hidden treat)
Understanding “K”
K is the number of nearest neighbors considered for predicting. It assigns a label to the point in concern. For example, if K=5, we consider 5 nearest points and use the label of the majority of these 6 points as the predicted label.
Considering the above example, the aim now is to assign a label to the hidden point 7.
If we consider K=3, we notice that the neighboring points of 7 are heart-shaped treats and 1 bone-shaped treat and if we consider K=5, 3 out of 5 neighboring points are the bone-shaped treat but how can we go about labeling the hidden point?
To classify the hidden data point and to assign the label the distance between the hidden data point and the neighboring points is calculated by leveraging mathematical functions called Euclidean Distance (Most common distance metric), Chebyshev distance, and Manhattan Distance.
It is now clear that the classification varies based on the K value. Choosing an accurate K value is an important parameter when working with this algorithm. This process of choosing an appropriate value is known as, “Parameter Tuning.” Choosing the value of K depends on individual datasets and the best method of selecting the value is to try different values of K to validate the outcomes. A value too small can increase the probability of overfitting the model and a large value can lead to the process being computationally expensive.
Some applications of the KNN Model
1. KNN is widely used by E-commerce and OTT Platforms. KNN is leveraged for recommending products, media to consume, and advertisements. For instance, if one purchases a smartphone from Amazon, recommendations for mobile accessories like covers, earphones, etc start surfacing
2. KNN techniques are often used for theft prevention in the modern retail business. Through KNN, it is easier to recognize patterns to scan and detect hidden packages in the bottom of the shopping cart at the check-out. If an object is detected that matches the item on the existing database, then the price of this spotted product is added to the customer’s bill
3. Another relevant usage of this algorithm in the retail industry is identifying patterns in credit card usage. Most transaction scrutinizing software use KNN to detect any unusual/suspicious activities
4. Certain advanced applications of KNN include handwriting detection, voice and image recognition
Applications of KNN in Healthcare
Medical data is extremely robust and continues multiple features. These records are huge resource banks for medical research. Medical data contains several patterns and relationships that can aid in enhancing the accuracy of diagnostic processes. Several research studies are being performed all over the globe to classify medical data based on KNN algorithms. The algorithm can be used to classify and predict the diagnosis of several diseases that present similar symptoms and multiple hypotheses have been tested for the following variables: allergies, age, blood pressure, diabetes, cholesterol, etc based on historical data.
Advantages and Limitations of the algorithm
To summarize, the K-nearest neighbor algorithm is a simple classification technique with a wide array of applications. Despite the simplicity, it can provide competitive results and can be used for regression models. Classification is based on occurrence and does not require one to develop an abstract model from a training data set. Although the classification process could be computationally expensive hence, it has room for improvement and modification
Thank you for reading this piece, feel free to drop in a comment or reach out to me on [email protected] :)
References
1. Gupta, S. (2019, May 29). KNN Machine Learning Algorithm Explained. Retrieved October 19, 2020, from https://in.springboard.com/blog/knn-machine-learning-algorithm-explained/
https://in.springboard.com/blog/knn-machine-learning-algorithm-explained/
2. Https://lyfat.wordpress.com/2012/05/22/euclidean-vs-chebyshev-vs-manhattan-distance/. (n.d.).
3. M’Haimdat, O. (2020, May 12). Understand the Fundamentals of the K-Nearest Neighbors (KNN) Algorithm. Retrieved October 19, 2020, from https://heartbeat.fritz.ai/understand-the-fundamentals-of-the-k-nearest-neighbors-knn-algorithm-533dc0c2f45a
4. Medical Health Big Data Classification Based on KNN Classification Algorithm. (n.d.). Retrieved October 19, 2020, from https://ieeexplore.ieee.org/document/8911389
5. Schott, M. (2020, February 27). K-Nearest Neighbors (KNN) Algorithm for Machine Learning. Retrieved October 19, 2020, from https://medium.com/capital-one-tech/k-nearest-neighbors-knn-algorithm-for-machine-learning-e883219c8f26
ELT IT and Intune Engineering Operations at Bausch + Lomb
4 年Very well written, Sheetal.
CEO at Koita Centre for Digital Diabetology-RSSDI
4 年Very insightful Dr. Sheetal Sippy
Great work Dr. Sheetal!
Business Strategy | Life-sciences | Direct To Patient Care |
4 年KNN is the magic wand for e-commerce. A well written article Dr. Sheetal Sippy!
ADVOCATE and NOTARY at A P SHAHANI
4 年Well researched article!? No wonder there is increase in impulse buying. The present aspect has also to be co related to other techniques which impress on peospective buyers to attain finality. @ Dr.? Sheetal Sippy. You have taken great pains to post these words.? I wish you many more articles like these from your side.? Best wishes..?