登录查看更多内容

KNN Points to Keep

Mukesh Manral????

??Data Science Specialist - AI + Product

发布日期: 2022年7月31日

+ 关注

All points you will require for kNN at one place.

In next Publish I will decompose kNN mathematical part. Enjoy

Points To Keep:

-- [ k-NN is a?non parametric, instance based, lazy supervised algorithm ] ---

1. kNN algorithm is widely used to benchmark more complex algos like SVM, ANN's, CNN's, Deep Networks

2. It is used in varieties of applications such as 

* Data Compression 
* Genetics - gene function prediction
* Aviation - air trafic flow prefiction
* Security - detection of distress in mood
------------------------------------------------
* Handwriting Detection (like OCR)
*?Image Recognition
* Video Recognition
* Recommendation Engines - using clickstream data from websites, the KNN algorithm has been used to provide automatic recommendations to users on additional content (not good if data is big)

* Finance - on credit data kNN can help banks assess risk of a loan to an organization or individual, can be used to determine credit-worthiness of a loan applicant

1.  kNN take more computation on test time rather than train time because 

* In Training Phase of kNN it only stores Feature Vectors & Class Labels of training samples

* In Testing Phase, a test point is classified by assigning lavel which are most frequent among training samples (say n-trainig samples) nearest to that query point so this process require more computation

---------------------------------------Computational Complexity for classifying new samples grows linearly with number of samples in training dataset in worst-case scenario

2. There are many distance metric which can be used in kNN like:

* Euclidean Distance
* Manhattan/L1 Distance -- coside chess distance between squares on?chessboard?for?rooks?is measured in Manhattan distance

* Hamming distance
* Minkowski Distance
* Jaccard Distance

* Chebyshev Distance -- consider chess kings?and?queens?use?Chebyshev distance

----------------------------------

* Manhaten Distance is designed for calculating distance between real valued features

* Euclidean Distance treats each feature as equally important 

----------------------------------

When to used which Distance Matrix:

* In case of "Continuous Variables" ==> Eucliden & Manhattan distances are used
* In case of "Categorical Variables" ==> Hamming distance is used

3. kNN algorith can be used for Classification and Regression Problem

In Regression problem case prediction can be based on mean/median pf k-most similar instance

4. k-NN performs much better if 

* data have same scale
* number of inputs is small and struggles when number of inputs becomes large/very large

* k-NN makes no assumptions about function form of problem being solveg i.e Y = f(x)

* One can use k-NN algorith to impute missings for both Continuous variable and Categorical Variables

Simple kNN model will lead to Low Variance and Hight Bias

kNN is very likely to overfit due to Curse of Dimensionality so one can use either Dimensionality Reduction or Feature Selection to overcome it

---------------------------------------------
Overfitting mean: 

model performs well on trainig data but it is not generalized enough to give same result on new/unseen data

While selecting "value of k" in kNN point to keep

* if given value of k is "Very Large" -- (Bias will Increase) -- model can include points of other classes into neighborhood

* if given value of k is "Very Small" -- (Variance will Increase) -- model can become very sensitive to noise

------------------------------------------
Decision boundy will become smoother when value of k will be increased

If not restricted to a single sample or number of times, one can draw samples from original dataset, a simple variance reduction method would be to "sample many times", and then
 
* simply take a majority vote of kNN models fit to each of these samples to classify each test data point

* This *variance reduction method* is called?bagging

领英推荐

YOLOR vs YOLOX: Battle Of The Object Detection…

Ritesh Kanjee 3 年前

K-Means Clustering in Machine Learning

Enio M. 9 个月前

Machine Learning Perspective on the Twin Prime…

Vincent Granville 3 年前

Training time for any value of k in kNN algorithm is same so relation given bellow is false:

1-NN < 2-NN < 3-NN

k represents number of nearest neighbours one want to select to predict class of a given item, which is coming as an unseen dataset for model

Effect of changing k

To find Optimal value of k in kNN one have to work with different value of k.

* optimal-k value can be obtained by 
HYPERPARAMETER TUNING

optimal-k value mostly depends upon dataset itself, based on different data and cases optimal-k value will change, there is no final method here

One can try for:

* SQARE ROOT METHOD : take training data number of samples squre root and assign it to k-value

* CROSS-VALIDATION METHOD : start with min value of k say(k=1) and run cross-validation till you get best accuracy

In general as value of k increases error goes down after each increasing step of k, pick optimal k, known as [Elbow Method]

* Domain Knowledge : you vs you

"Odd Values of k" should be preferred over Even Values in order to ensure that there are no ties in voting.

* k shoud be square root of n(number of datapoints in trainig data) 

* If Square Root of a number of data points is even, then Add or Subtract one(1) with it to make it odd

Steps how kNN Algorith makes predictions on unseen dataset
-------------------------------------------------

Undergiven steps will be repeated during each iteration of algorith

For each unseen data point kNN Classifier will perform these steps:

S1 : calculate distance of all test data points with all training data points and store them

S2: sort all stored distance in Increasing Order

S3: store "k-nearest" points from our training(point) dataset

S4: calculate proportion of each class

S5: assign class with highest proportion

Feature Scaling is required to get better performance on kNN Algorithm
----------------------------------------
Say:
* we have dataset with n-datapoints and N-features
* on feature have values in range 0 to 1
* another feature have values in range -100 to 100

now these different ranges will effect formula of distance based metrices like Euclidean Distance, as higher weightage will be given to variables having higher magnitude/range.

Time Complexity of kNN Algorithm:
--------------------------------------------

Distance calculation step requires Quadratic Time Complexity and Sorting of calculated distance required an "O(N log N) time"

Combined process will be O(N^3 log N) process, which is a long process

Space Complexity of kNN Algorithm:
--------------------------------------------

As kNN stores all pairwise distances, memory is a problem, some machine can crash if data is large

kNN is called Lazy Supervised Algorith / Lazy Learner as:
------------------------------------------
when this algorithm gets training data, it never learn and never does modeling at time

* it stores data

Instead of finding any function with the help of training data it followes "INSTANCE Based Learning" and also uses training data whenever it is required to predict on unseen data

Disadvantages of kNN Algoritm:
________________________________________________

1. Dont work well with large dataset / hight dimensions

2. Requires Feature Scaling

3. Senstive to Noise and Outliers

It is very much possible to use kNN for Image Processing
-----------------------------------------------

Converting a 3-d image into a single-dimensional vector and then give it as input to kNN

Check out these distance based metrics:

Happy Learning

Manralai-AiConsulting+Edu

1,235 位关注者

要查看或添加评论，请登录

Mukesh Manral????的更多文章

Master Categorical Data Encoding Methods in 60 seconds??

2024年7月30日

Master Categorical Data Encoding Methods in 60 seconds??

The performance of a machine learning model not only depends on the model and the hyper-parameters but also on how we…
Attention

2024年2月5日

Attention

Mukesh Manral???? 01. Understanding Attention In this lesson, we'll be talking about one of the most important…
SQL in a Nutshell: A Hilarious Breakdown by Mukesh Manral???? #sql?@Manralai

2023年12月16日

SQL in a Nutshell: A Hilarious Breakdown by Mukesh Manral???? #sql?@Manralai

Feeling overwhelmed by the thought of learning SQL? Fear not Manralai ?? Follow Mukesh Manral???? for more. #sqlday1…

1 条评论
Navigating the Evolution of NLP: A Comprehensive Deep Dive into Cutting-Edge Models Beyond 2013 ?? #nlp #deeplearning

2023年11月29日

Navigating the Evolution of NLP: A Comprehensive Deep Dive into Cutting-Edge Models Beyond 2013 ?? #nlp #deeplearning

?? Timeline view of different deep learning approaches in Natural Language Processing that have been developed to…

1 条评论
??ChatGpt : Explaining Math's behind ChatGpt without getting into math's

2023年5月15日

??ChatGpt : Explaining Math's behind ChatGpt without getting into math's

?? What is Chat GPT? ?? Chat GPT is an AI program that can understand natural language and generate human-like…
Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

2023年3月27日

Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

?? Discover most popular Deep Learning kernel_initializers used in real-world projects and learn how to implement them…

1 条评论
Why did the computer vision engineer choose YOLO?

2023年3月14日

Why did the computer vision engineer choose YOLO?

?? Why did the computer vision engineer choose YOLO? ?? Because they wanted to detect objects in real-time, but didn't…
Hypothesis Testing - Framework

2023年3月8日

Hypothesis Testing - Framework

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions…
Feature Importance and Feature Selection - Framework

2023年2月25日

Feature Importance and Feature Selection - Framework

?? Feature Importance and Feature Selection ????Feature selection is itself useful, but it mostly acts as a filter…
Handling ImBalanced Classes-Framework

2023年2月25日

Handling ImBalanced Classes-Framework

If you want more detailed pdf with implementations, comment bellow. For more follow:

1 条评论

See all articles

KNN Points to Keep

Mukesh Manral????

??Data Science Specialist - AI + Product

Points To Keep:

领英推荐

Manralai-AiConsulting+Edu

1,235 位关注者

Mukesh Manral????的更多文章

社区洞察

其他会员也浏览了

State of the Graph: Knowledge Graphs Emerge As First Killer App

Where will LLMs be in the Next 12 Months?

Part III: A Mathematical Model for Systemic Dimensional Cyberprofiling

Simulating the Physical World: Stochastic 'Model-driven' Digital Twins.

Is Claude.AI good at mathematical reasoning?

Norman Alex Newsletter

Algorithm Deep Dive: Newton’s Method

Moravec's Paradox: Intelligence that is not "Intelligent" !

Representing a Problem's State-Space

Top 50 AI and Dev Tools to Add to Your Arsenal

Points To Keep:

领英推荐

Manralai-AiConsulting+Edu

1,235 位关注者

Mukesh Manral????的更多文章

Master Categorical Data Encoding Methods in 60 seconds??

Attention

SQL in a Nutshell: A Hilarious Breakdown by Mukesh Manral???? #sql?@Manralai

Navigating the Evolution of NLP: A Comprehensive Deep Dive into Cutting-Edge Models Beyond 2013 ?? #nlp #deeplearning

??ChatGpt : Explaining Math's behind ChatGpt without getting into math's

Guide to Commonly Used Deep Learning Kernel_Initializers in Real-World Projects

Why did the computer vision engineer choose YOLO?

Hypothesis Testing - Framework

Feature Importance and Feature Selection - Framework

Handling ImBalanced Classes-Framework

社区洞察

其他会员也浏览了

State of the Graph: Knowledge Graphs Emerge As First Killer App

Where will LLMs be in the Next 12 Months?

Part III: A Mathematical Model for Systemic Dimensional Cyberprofiling

Simulating the Physical World: Stochastic 'Model-driven' Digital Twins.

Is Claude.AI good at mathematical reasoning?

Norman Alex Newsletter

Algorithm Deep Dive: Newton’s Method

Moravec's Paradox: Intelligence that is not "Intelligent" !

Representing a Problem's State-Space

Top 50 AI and Dev Tools to Add to Your Arsenal