Is Accuracy simply a similarity test ?
Photo by Mika Baumeister on Unsplash

Is Accuracy simply a similarity test ?

Accuracy has been a cornerstone metric in machine learning, playing a vital role in assessing a model’s performance and offering clear measure of overall correctness and also simplicity and interpretability. But how is Accuracy seen in the world of data mining??

What is data mining???:?

  • According to Oracle?, data mining is the process of discovering insights when dealing with large volumes of data. This data can come from many sources or a single database, and insights may be generated through manual discovery or automation
  • The process involves deep analysis of data to discover patterns and underlying factors, all to create conclusions and produce informed decisions
  • One of the methods used to better understand our data in data mining is called proximity test?, so what’s that about??

What is proximity test?:?

  • In data mining, a proximity test is testing either similarity or dissimilarity of two objects
  • Similarity test is a method for quantifying how alike two pieces of data are. It essentially measures the closeness or relationship between data points. This measurement is typically expressed as a numerical score, often ranging from 0 (completely dissimilar) to 1 (identical).?
  • There is also a the dissimilarity test which is numerical measure of how different two data objects are where lower bounds cannot be less than 0 while upper bounds values can vary?

Applications of proximity tests?:?

Proximity tests can be found/used in different contexts in machine learning such as?:

  • Clustering?: Grouping similar items in one group requires some sort of similarity test between elements
  • Anomaly detection?: Identifying outlier or items that deviate significantly from the majority of the data requires a dissimilarity test between items where the more dissimilar the bigger that possibility of it being an outlier
  • Recommendation systems?: Users with similar preferences can have similar suggestions
  • Duplicates detection?: Proximity test can also be used for preprocessing?, where we can a similarity test to detect duplicate data


Similarity measures?:

We have mentioned earlier similarity tests as part of the proximity tests?, but we have different measures for it which can differ according to the type of data we’re dealing with?:?

I?—?Numerical Measures?:

These measures tests works on numerical data that is represented by numbers which can be ordered and calculated using mathematical operations (this includes continuous and discrete data) such as age?, height?, temperature?…etc

  • Euclidean distance?: it is a straight-line distance between two data points in a any dimensional space (one?, two and higher dimensions)
  • Manhattan Distance?: it is named after the grid shape of streets in Manhattan?, which is the sum of the absolute values of differences between two points?

The difference between the two measures can be seen in the following picture?:

Difference between Euclidean & Manhattan distance

II?—?Binary Measures?:?

These measures works on data that can only have two possible values (either 0 or 1) such as virus test?, light switch?….etc

  • Simple Matching Coefficient (SMC)?: It essentially measures how many corresponding elements the sets share in relation to their total number of elements meaning the number of matching attributes over the sum of all attributes and it is used for binary symmetric attributes

Formula for Simple Matching Coefficient

  • Jacquard Index?: It is used for asymmetric binary attributes where the two states are not equally important (for instance, in the case where positive (1) and negative (0) are outcomes of a disease test) where the default state is ignored with jacquard index

Formula for Jacquard index

Looking at both measures?, it looks like they’re very similar but in reality these are very general formula because in details the difference is that the jacquard index focuses only on elements present in both sets while SMC focuses on elements present and absent from both sets

Accuracy & SMC :

Accuracy is how close a given set of measurements (observations or readings) ? are to their true value y

  • Let's consider accuracy for binary classification , it's calculated as follows :

Accuracy Formula

  • We can notice that?, this reminds us of something we saw prior. if you think about it?, the true positives and true negatives represents the count of our observations that truly matched our predictions over the total number of sample. so it’s the number of matching attributes over the total number of attributes which is SMC. In details SMC is calculated as follows:?

SMC Formula detailed?

This is calculated using the SMC matrix?, which is oddly similar to the confusion matrix used for accuracy?:?

SMC Matrix?

M00 is the total number of attributes where A and B both have a value of 0, and for M11 both have value of 1 which means M00 = TP and M11 = TN and therefore Accuracy = SMC ( be aware that this is only in the case of binary classification )?

Conclusions?:?

  • We have seen that we have many proximity measures that are useful and used in many aspects in ML
  • Proximity measures help calculate the similarity and dissimilarity between elements?
  • In the case of binary classification?, Accuracy is calculating the similarity between our predicted and observed values (SMC) meaning it’s a similarity measure

References?:?

  1. Oracle?—?Data Mining Defined??
  2. Rhodes University COMP 465?—?Similarity Lecture Slides
  3. Difference between Euclidean & Manhattan distances
  4. Simple Matching Coefficient?—?Wiki?
  5. Accuracy & Precision


  • You can find this same article on medium


要查看或添加评论,请登录

Taissir B.的更多文章

社区洞察

其他会员也浏览了