Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

To ensure that Predictive Analytical Models deliver the benefits and value of its intended use, organizations must ensure that these models are validated in line with available data, context and assumptions. This should be done at ideation and creation of the model to ensure alignment with the business problem statement, and ongoingly monitored to ensure that this alignment is maintained.

No alt text provided for this image

(SlideTeam, 2020)

Training and Testing Dataset

To achieve this outcome a cleansed dataset with pre-established classifications or outcomes can be divided into two subsets, one to train the model and a second to test and validate the predictions (Berecibar, et al., 2016). The basic idea is to test the generated model with data that is known and has a result to enable validation for future predictions. Various techniques can be used such as Holdout or Cross Validation that iteratively tests the dataset, and using voting or aggregation, establishes the validity of the model.

No alt text provided for this image

(scikit-learn, 2019)

The Model

Classifying future inputs based on existing datasets, the K-Nearest Neighbor algorithm is one of the more popular classification methods using various distance techniques such as Dice, Jaccard, Kulczynski and RussellRao Similarity for Nominal types and Euclidean, Canberra, Chebychev and Manhattan Distance for Numerical Types, voting and aggregation distance weights and groupings to assist in the resulting valuation (Holmes & Adams, 2002). Establishing the value of K is also very important. Too small would make the model less stable whereas too large would likely overfit the model.

No alt text provided for this image

(DATA SCIENCE AND ANALYTICS, 2020)

Checking the Model Performance

Once you have established the model, validating the model requires pushing through the testing subset of the data and measuring the correctness of the output labels based on the hyperparameters specified (and tweaking them to achieve better accuracy and efficiency). To achieve this, we can use a confusion matrix that specifies the proportions of correctly predicted results against the classification categories (Beguería, 2006).

No alt text provided for this image

(Ragan, 2018)

No alt text provided for this image

(Beguería, 2006)

This matrix gives us the ability to identify true Positive and Negative matches as opposed to False Positives (Type I error) and False Negatives (Type II error), and gives us some insight into the accuracy of the model along with some other statistics of interest such as the efficiency, misclassification, and Positive and Negative predictive power.

To ensure that the model produces accurate results it is imperative that the model be validated and the resulting outputs tested (initially and ongoing). The method for splitting the dataset, the number of elements chosen and the scope of the number of neighbors in the model all influence the outcome of the model and can thus greatly affect the validity of future predictions. As these outputs would typically be used in decisioning that can affect real life scenarios such as fraud detection and loan approvals, it is imperative to establish the credibility of the model.

References

Beguería, S., 2006. Validation and Evaluation of Predictive Models in Hazard Assessment and Risk Management. NATURAL HAZARDS, 37(3), pp. 315-329.

Berecibar, M. et al., 2016. Online state of health estimation on NMC cells based on predictive analytics. Journal of Power Sources, Volume 320, pp. 239-250.

DATA SCIENCE AND ANALYTICS, 2020. Classification Series 5 – K-Nearest Neighbors (knn). [Online] Available at: https://dslytics.wordpress.com/2017/11/16/classification-series-5-k-nearest-neighbors-knn/ [Accessed 24 Jan 2020].

Holmes, C. C. & Adams, N. M., 2002. A Probabilistic Nearest Neighbour Method for Statistical Pattern Recognition. Journal of the Royal Statistical Society, 64(2), pp. 295-306.

Ragan, A., 2018. Taking the Confusion Out of Confusion Matrices. [Online] Available at: https://towardsdatascience.com/taking-the-confusion-out-of-confusion-matrices-c1ce054b3d3e [Accessed 24 Jan 2020].

scikit-learn, 2019. Cross-validation: evaluating estimator performance. [Online] Available at: https://scikit-learn.org/stable/modules/cross_validation.html [Accessed 24 Jan 2020].

SlideTeam, 2020. Predictive Modelling Powerpoint Presentation Slides. [Online] Available at: https://www.slideteam.net/predictive-modelling-powerpoint-presentation-slides.html [Accessed 24 Jan 2020].


Greg Coquillo

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure | Content Creator

4 年

Great post Adriaan, I really enjoyed reading it and it was very instructive to a newbie like me!

Vikash Yadav

Building the Future of Real Estate | Founder at Propertyao | Driving Growth for Builders & Agents Through Tech & Marketing

4 年

Agreed.. Nice Post.. Thanks for sharing

Maik L.

Prove | Improve | Contribute | Network |

4 年

Out of the box, on spot. Yes, couldn't agree more! When you find a breakthrough, in a system that could bring effect in forthcoming results, shows your clarity on the work/process and efforts. You not only innovate, but also bring much needed brilliance. Thank you Adriaan Stander for amazing post.

Jacqueline Yeung

Mindset Rebuilding · Mental Health Advocate · Author · Thinker ? Striving to help and support others along my journey!

4 年

100% agreed! Thank you fir sharing!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了