登录查看更多内容

Is Accuracy simply a similarity test ?

Taissir B.

Graduate Data Scientist @UH & Flutter Mobile Developper

发布日期: 2024年3月23日

Accuracy has been a cornerstone metric in machine learning, playing a vital role in assessing a model’s performance and offering clear measure of overall correctness and also simplicity and interpretability. But how is Accuracy seen in the world of data mining??

What is data mining???:?

According to Oracle?, data mining is the process of discovering insights when dealing with large volumes of data. This data can come from many sources or a single database, and insights may be generated through manual discovery or automation
The process involves deep analysis of data to discover patterns and underlying factors, all to create conclusions and produce informed decisions
One of the methods used to better understand our data in data mining is called proximity test?, so what’s that about??

What is proximity test?:?

In data mining, a proximity test is testing either similarity or dissimilarity of two objects
Similarity test is a method for quantifying how alike two pieces of data are. It essentially measures the closeness or relationship between data points. This measurement is typically expressed as a numerical score, often ranging from 0 (completely dissimilar) to 1 (identical).?
There is also a the dissimilarity test which is numerical measure of how different two data objects are where lower bounds cannot be less than 0 while upper bounds values can vary?

Applications of proximity tests?:?

Proximity tests can be found/used in different contexts in machine learning such as?:

Clustering?: Grouping similar items in one group requires some sort of similarity test between elements
Anomaly detection?: Identifying outlier or items that deviate significantly from the majority of the data requires a dissimilarity test between items where the more dissimilar the bigger that possibility of it being an outlier
Recommendation systems?: Users with similar preferences can have similar suggestions
Duplicates detection?: Proximity test can also be used for preprocessing?, where we can a similarity test to detect duplicate data

Similarity measures?:

We have mentioned earlier similarity tests as part of the proximity tests?, but we have different measures for it which can differ according to the type of data we’re dealing with?:?

I?—?Numerical Measures?:

These measures tests works on numerical data that is represented by numbers which can be ordered and calculated using mathematical operations (this includes continuous and discrete data) such as age?, height?, temperature?…etc

Euclidean distance?: it is a straight-line distance between two data points in a any dimensional space (one?, two and higher dimensions)
Manhattan Distance?: it is named after the grid shape of streets in Manhattan?, which is the sum of the absolute values of differences between two points?

The difference between the two measures can be seen in the following picture?:

Difference between Euclidean & Manhattan distance

II?—?Binary Measures?:?

These measures works on data that can only have two possible values (either 0 or 1) such as virus test?, light switch?….etc

Simple Matching Coefficient (SMC)?: It essentially measures how many corresponding elements the sets share in relation to their total number of elements meaning the number of matching attributes over the sum of all attributes and it is used for binary symmetric attributes

Pratibha Kumari J. 1 年前

5 applications of data mining

Naveen Joshi 4 年前

Data Mining in the Age of AI: Uncovering Patterns and…

DataThick 5 个月前

Jacquard Index?: It is used for asymmetric binary attributes where the two states are not equally important (for instance, in the case where positive (1) and negative (0) are outcomes of a disease test) where the default state is ignored with jacquard index

Looking at both measures?, it looks like they’re very similar but in reality these are very general formula because in details the difference is that the jacquard index focuses only on elements present in both sets while SMC focuses on elements present and absent from both sets

Accuracy & SMC :

Accuracy is how close a given set of measurements (observations or readings) ? are to their true value y

Let's consider accuracy for binary classification , it's calculated as follows :

We can notice that?, this reminds us of something we saw prior. if you think about it?, the true positives and true negatives represents the count of our observations that truly matched our predictions over the total number of sample. so it’s the number of matching attributes over the total number of attributes which is SMC. In details SMC is calculated as follows:?

This is calculated using the SMC matrix?, which is oddly similar to the confusion matrix used for accuracy?:?

M00 is the total number of attributes where A and B both have a value of 0, and for M11 both have value of 1 which means M00 = TP and M11 = TN and therefore Accuracy = SMC ( be aware that this is only in the case of binary classification )?

Conclusions?:?

We have seen that we have many proximity measures that are useful and used in many aspects in ML
Proximity measures help calculate the similarity and dissimilarity between elements?
In the case of binary classification?, Accuracy is calculating the similarity between our predicted and observed values (SMC) meaning it’s a similarity measure

References?:?

You can find this same article on medium

要查看或添加评论，请登录

Taissir B.的更多文章

How Fourier Series helped introduce FCNN ?

2024年3月3日

How Fourier Series helped introduce FCNN ?

Signal processing is a fundamental discipline in data science that deals with the extraction, analysis, and…

Is Accuracy simply a similarity test ?

Taissir B.

Graduate Data Scientist @UH & Flutter Mobile Developper

What is data mining???:?

What is proximity test?:?

Applications of proximity tests?:?

Similarity measures?:

I?—?Numerical Measures?:

II?—?Binary Measures?:?

领英推荐

Accuracy & SMC :

Conclusions?:?

References?:?

Taissir B.的更多文章

社区洞察

其他会员也浏览了

How Data Mining can help Organizations as well as Startups?

Data Mining Foundations

Operational Data Mining for better decision-making (Part 2 )

Data Mining

7 Data Mining Functionalities Every Data Scientists Should Know About

Process data mining and visualization steer digital transformation

Leveraging Data Mining Techniques for Market Insights

The Science of Data Mining (Part 2) — Data Preparation and Cleaning

Why Data Mining is Still Important in 2019

Turning Data into Solutions: Data Mining Best Practices and Trends

What is data mining???:?

What is proximity test?:?

Applications of proximity tests?:?

Similarity measures?:

I?—?Numerical Measures?:

II?—?Binary Measures?:?

领英推荐

Accuracy & SMC :

Conclusions?:?

References?:?

Taissir B.的更多文章

How Fourier Series helped introduce FCNN ?

社区洞察

其他会员也浏览了

How Data Mining can help Organizations as well as Startups?

Data Mining Foundations

Operational Data Mining for better decision-making (Part 2 )

Data Mining

7 Data Mining Functionalities Every Data Scientists Should Know About

Process data mining and visualization steer digital transformation

Leveraging Data Mining Techniques for Market Insights

The Science of Data Mining (Part 2) — Data Preparation and Cleaning

Why Data Mining is Still Important in 2019

Turning Data into Solutions: Data Mining Best Practices and Trends