登录查看更多内容

Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

Adriaan Stander

Head of Software Engineering - Digital Banking

发布日期: 2020年1月26日

To ensure that Predictive Analytical Models deliver the benefits and value of its intended use, organizations must ensure that these models are validated in line with available data, context and assumptions. This should be done at ideation and creation of the model to ensure alignment with the business problem statement, and ongoingly monitored to ensure that this alignment is maintained.

(SlideTeam, 2020)

Training and Testing Dataset

To achieve this outcome a cleansed dataset with pre-established classifications or outcomes can be divided into two subsets, one to train the model and a second to test and validate the predictions (Berecibar, et al., 2016). The basic idea is to test the generated model with data that is known and has a result to enable validation for future predictions. Various techniques can be used such as Holdout or Cross Validation that iteratively tests the dataset, and using voting or aggregation, establishes the validity of the model.

(scikit-learn, 2019)

The Model

Classifying future inputs based on existing datasets, the K-Nearest Neighbor algorithm is one of the more popular classification methods using various distance techniques such as Dice, Jaccard, Kulczynski and RussellRao Similarity for Nominal types and Euclidean, Canberra, Chebychev and Manhattan Distance for Numerical Types, voting and aggregation distance weights and groupings to assist in the resulting valuation (Holmes & Adams, 2002). Establishing the value of K is also very important. Too small would make the model less stable whereas too large would likely overfit the model.

(DATA SCIENCE AND ANALYTICS, 2020)

Checking the Model Performance

Once you have established the model, validating the model requires pushing through the testing subset of the data and measuring the correctness of the output labels based on the hyperparameters specified (and tweaking them to achieve better accuracy and efficiency). To achieve this, we can use a confusion matrix that specifies the proportions of correctly predicted results against the classification categories (Beguería, 2006).

(Ragan, 2018)

(Beguería, 2006)

This matrix gives us the ability to identify true Positive and Negative matches as opposed to False Positives (Type I error) and False Negatives (Type II error), and gives us some insight into the accuracy of the model along with some other statistics of interest such as the efficiency, misclassification, and Positive and Negative predictive power.

To ensure that the model produces accurate results it is imperative that the model be validated and the resulting outputs tested (initially and ongoing). The method for splitting the dataset, the number of elements chosen and the scope of the number of neighbors in the model all influence the outcome of the model and can thus greatly affect the validity of future predictions. As these outputs would typically be used in decisioning that can affect real life scenarios such as fraud detection and loan approvals, it is imperative to establish the credibility of the model.

References

Beguería, S., 2006. Validation and Evaluation of Predictive Models in Hazard Assessment and Risk Management. NATURAL HAZARDS, 37(3), pp. 315-329.

Berecibar, M. et al., 2016. Online state of health estimation on NMC cells based on predictive analytics. Journal of Power Sources, Volume 320, pp. 239-250.

DATA SCIENCE AND ANALYTICS, 2020. Classification Series 5 – K-Nearest Neighbors (knn). [Online] Available at: https://dslytics.wordpress.com/2017/11/16/classification-series-5-k-nearest-neighbors-knn/ [Accessed 24 Jan 2020].

Holmes, C. C. & Adams, N. M., 2002. A Probabilistic Nearest Neighbour Method for Statistical Pattern Recognition. Journal of the Royal Statistical Society, 64(2), pp. 295-306.

Ragan, A., 2018. Taking the Confusion Out of Confusion Matrices. [Online] Available at: https://towardsdatascience.com/taking-the-confusion-out-of-confusion-matrices-c1ce054b3d3e [Accessed 24 Jan 2020].

scikit-learn, 2019. Cross-validation: evaluating estimator performance. [Online] Available at: https://scikit-learn.org/stable/modules/cross_validation.html [Accessed 24 Jan 2020].

SlideTeam, 2020. Predictive Modelling Powerpoint Presentation Slides. [Online] Available at: https://www.slideteam.net/predictive-modelling-powerpoint-presentation-slides.html [Accessed 24 Jan 2020].

Greg Coquillo

4 年

Great post Adriaan, I really enjoyed reading it and it was very instructive to a newbie like me!

3 次回应

Vikash Yadav

Building the Future of Real Estate | Founder at Propertyao | Driving Growth for Builders & Agents Through Tech & Marketing

4 年

Agreed.. Nice Post.. Thanks for sharing

1 次回应

Maik L.

Prove | Improve | Contribute | Network |

4 年

Out of the box, on spot. Yes, couldn't agree more! When you find a breakthrough, in a system that could bring effect in forthcoming results, shows your clarity on the work/process and efforts. You not only innovate, but also bring much needed brilliance. Thank you Adriaan Stander for amazing post.

3 次回应

Jacqueline Yeung

Mindset Rebuilding · Mental Health Advocate · Author · Thinker ? Striving to help and support others along my journey!

4 年

100% agreed! Thank you fir sharing!

2 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Birds Of A Feather. Or Do They? - K Nearest Neighbors Validation

Adriaan Stander

Head of Software Engineering - Digital Banking

Training and Testing Dataset

The Model

Checking the Model Performance

References

更多精彩文章

社区洞察

其他会员也浏览了

"Dynamic Approach to Tackling Multicollinearity & Overfitting with Lasso Regression: Real-Time Health Data Insights"

Correlation Does Not Imply Causation

K-nearest neighbor Classification(KNN)

Where Analytics, Data Science, Machine Learning Were Applied: Trends and Analysis

Decision Tree Classification

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

Decision Tree: How Does It Work in Today's Context?

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

Training and Testing Dataset

The Model

Checking the Model Performance

References

I am a Long Term Investor

2024年6月22日

What do you mean an Internal Developer Platform Culture?

2023年11月22日

We Are All Connected

2023年3月9日

Be the best version of you

2022年2月4日

Scaling agile the right way

2022年1月29日

Why you shouldn't "Let me show you"

2021年11月1日

From Why to Why Not

2021年7月21日

Your Data Lake On Amazon S3

2020年8月9日

Four Simple Rules To Live By

2020年5月31日

Just Off The Beaten Path Lies A World Unexplored. What Can You Risk?

2020年4月18日

社区洞察

其他会员也浏览了

"Dynamic Approach to Tackling Multicollinearity & Overfitting with Lasso Regression: Real-Time Health Data Insights"

Correlation Does Not Imply Causation

K-nearest neighbor Classification(KNN)

Where Analytics, Data Science, Machine Learning Were Applied: Trends and Analysis

Decision Tree Classification

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

How to Deal With Imbalanced Classification and Imbalanced Regression Data?

Decision Tree: How Does It Work in Today's Context?

How to Deal With Imbalanced Classification and Imbalanced Regression Data?