Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation
To ensure that Predictive Analytical Models deliver the benefits and value of its intended use, organizations must ensure that these models are validated in line with available data, context and assumptions. This should be done at ideation and creation of the model to ensure alignment with the business problem statement, and ongoingly monitored to ensure that this alignment is maintained.
(SlideTeam, 2020)
Training and Testing Dataset
To achieve this outcome a cleansed dataset with pre-established classifications or outcomes can be divided into two subsets, one to train the model and a second to test and validate the predictions (Berecibar, et al., 2016). The basic idea is to test the generated model with data that is known and has a result to enable validation for future predictions. Various techniques can be used such as Holdout or Cross Validation that iteratively tests the dataset, and using voting or aggregation, establishes the validity of the model.
(scikit-learn, 2019)
The Model
Classifying future inputs based on existing datasets, the K-Nearest Neighbor algorithm is one of the more popular classification methods using various distance techniques such as Dice, Jaccard, Kulczynski and RussellRao Similarity for Nominal types and Euclidean, Canberra, Chebychev and Manhattan Distance for Numerical Types, voting and aggregation distance weights and groupings to assist in the resulting valuation (Holmes & Adams, 2002). Establishing the value of K is also very important. Too small would make the model less stable whereas too large would likely overfit the model.
(DATA SCIENCE AND ANALYTICS, 2020)
Checking the Model Performance
Once you have established the model, validating the model requires pushing through the testing subset of the data and measuring the correctness of the output labels based on the hyperparameters specified (and tweaking them to achieve better accuracy and efficiency). To achieve this, we can use a confusion matrix that specifies the proportions of correctly predicted results against the classification categories (Beguería, 2006).
(Ragan, 2018)
(Beguería, 2006)
This matrix gives us the ability to identify true Positive and Negative matches as opposed to False Positives (Type I error) and False Negatives (Type II error), and gives us some insight into the accuracy of the model along with some other statistics of interest such as the efficiency, misclassification, and Positive and Negative predictive power.
To ensure that the model produces accurate results it is imperative that the model be validated and the resulting outputs tested (initially and ongoing). The method for splitting the dataset, the number of elements chosen and the scope of the number of neighbors in the model all influence the outcome of the model and can thus greatly affect the validity of future predictions. As these outputs would typically be used in decisioning that can affect real life scenarios such as fraud detection and loan approvals, it is imperative to establish the credibility of the model.
References
Beguería, S., 2006. Validation and Evaluation of Predictive Models in Hazard Assessment and Risk Management. NATURAL HAZARDS, 37(3), pp. 315-329.
Berecibar, M. et al., 2016. Online state of health estimation on NMC cells based on predictive analytics. Journal of Power Sources, Volume 320, pp. 239-250.
DATA SCIENCE AND ANALYTICS, 2020. Classification Series 5 – K-Nearest Neighbors (knn). [Online] Available at: https://dslytics.wordpress.com/2017/11/16/classification-series-5-k-nearest-neighbors-knn/ [Accessed 24 Jan 2020].
Holmes, C. C. & Adams, N. M., 2002. A Probabilistic Nearest Neighbour Method for Statistical Pattern Recognition. Journal of the Royal Statistical Society, 64(2), pp. 295-306.
Ragan, A., 2018. Taking the Confusion Out of Confusion Matrices. [Online] Available at: https://towardsdatascience.com/taking-the-confusion-out-of-confusion-matrices-c1ce054b3d3e [Accessed 24 Jan 2020].
scikit-learn, 2019. Cross-validation: evaluating estimator performance. [Online] Available at: https://scikit-learn.org/stable/modules/cross_validation.html [Accessed 24 Jan 2020].
SlideTeam, 2020. Predictive Modelling Powerpoint Presentation Slides. [Online] Available at: https://www.slideteam.net/predictive-modelling-powerpoint-presentation-slides.html [Accessed 24 Jan 2020].
Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure | Content Creator
4 年Great post Adriaan, I really enjoyed reading it and it was very instructive to a newbie like me!
Building the Future of Real Estate | Founder at Propertyao | Driving Growth for Builders & Agents Through Tech & Marketing
4 年Agreed.. Nice Post.. Thanks for sharing
Prove | Improve | Contribute | Network |
4 年Out of the box, on spot. Yes, couldn't agree more! When you find a breakthrough, in a system that could bring effect in forthcoming results, shows your clarity on the work/process and efforts. You not only innovate, but also bring much needed brilliance. Thank you Adriaan Stander for amazing post.
Mindset Rebuilding · Mental Health Advocate · Author · Thinker ? Striving to help and support others along my journey!
4 年100% agreed! Thank you fir sharing!