Cosine Similarity & Classification

Cosine Similarity & Classification

Set similarity is a difficult problem to solve using traditional rule based programming. The problem exacerbates when there is a large number of attributes in the dataset. Unsurprisingly, Math comes to rescue. Using the Cosine function & K-Nearest Neighbor algorithm, we can determine how similar or different two sets of items are and use it to determine the classification. The Cosine function is used to calculate the Similarity or the Distance of the observations in high dimensional space.

For example here is a list of fruits & their attributes:

If two sets of attributes are identical, we'd like our similarity function to return One (1). If two sets of attributes are completely different, we'd like the function to return Zero (0). If there are some significant overlap between the two sets, we'd like the function to return a real number that is closer one (ex: 0.95) and vice versa the most dissimilar two sets of items are we want the items to approximate zero (ex: 0.05). Fortunately the positive values of cosine exhibit this behavior:

Cosine(0)   = 1.0   # 0 degrees angle
Cosine(π/4) ≈ 0.7   # 45 degrees angle
Cosine(π/2) = 0.0   # 90 degrees angle

In physics we represent a point in an N-dimensional space using a vector. For example a seat at a Football stadium could be described by the distance relative from the center point of the stadium floor. If we add the time dimension, we would have a 4th dimension to represent the occupancy of that seat perhaps for different football games in a particular season. Similarly we can model our our Set Similarity problem in an N dimensional space where each dimension represent an attribute in our dataset. Since N could be really large, we typically say this model is a High Dimensional Vector Space. Every single observation can be represented as a point (and therefore, a vector) in this space.

If we want to compare how similar two items are, we represent each object or entity as a vector in N dimensional space first, then we calculate the Cosine value of the angle between those two vectors. Fortunately, this cosine value can be easily computed as the dot product of the two vectors divided by the product of their magnitudes.

In Vector Space Models, if Vector A is the same as Vector B, we know that the angle theta will be zero and thus the Similarity(A,B) where B = A, results in a value of One. ie: they are most similar or identical.

Here's a snippet of python code that computes the Cosine(θ) of two vectors:

import math
from collections import Counter


def cosine_similarity(v1, v2):
    terms = set(v1).union(v2)
    dotprod = sum(v1.get(k, 0) * v2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(v1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(v2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

#ref: quora

Let's try this Cosine Similarity function to see how it behaves:

# create 5 list of items
A = Counter(list("abcd"))
B = Counter(list("ab"))
C = Counter(list("abcde"))    # very similar to A
D = Counter(list("abcd"))     # identical to A
E = Counter(list("xyz"))      # very different from A

A, B, C, D, E

#Output:
(Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1}),
 Counter({'a': 1, 'b': 1}),
 Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}),
 Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1}),
 Counter({'x': 1, 'y': 1, 'z': 1}))


# now let's run the similarity comparisons
cosine_similarity(A,B), \
cosine_similarity(B,C), \
cosine_similarity(A,C), \
cosine_similarity(A,D), \
cosine_similarity(A,E)

# output
(0.7071067811865475, 0.6324555320336759, 0.8944271909999159, 1.0, 0.0)

# therefore as we expect
Sim(A,B) => 0.71    # some similarity to A
Sim(B,C) => 0.63    # some similarity btw B & C
Sim(A,C) => 0.89    # most similar to A
Sim(A,D) => 1.0     # identical to A
Sim(A,E) => 0.0     # different from A

Cosine Distance

Incidentally, Cosine Distance is defined as distance between two points in High Dimensional Space. It is defined as the value equals to 1 - Similarity(A, B). Therefore the range of the Cosine Distance ranges from 0 to 1 as well. If the Cosine Distance is zero (0), that means the items are identical. If the Cosine Distance is one (1) that means the items are definitely different.

Cosine_Distance (A, B) = 1 - Cosine_Similarity (A, B)

#following previous example

cosine_distance(A,B) => 0.29    # some similarity to A
cosine_distance(B,C) => 0.37    # some similarity btw B & C
cosine_distance(A,C) => 0.11    # most similar to A
cosine_distance(A,D) => 0.0     # identical to A
cosine_distance(A,E) => 1.0     # different from A

It is amazing that a simple Trigonometric math function can help us compute the similarity of sets.

Let's now put this function to good use by classifying animals using the Animals Dataset from UC Irvine.

Similarity Determination using the Zoo Animals Dataset

The animals dataset from UC Irvine consist of the following which we will use to find similar items using Cosine Similarity.

Animals / Zoo Dataset Information: 

A database containing 101 instances of 17 Boolean-valued attributes. 
The "type" attribute appears to be the class attribute. 

Here is a breakdown of which animals classes 

    1 (41) aardvark, antelope, bear, boar, buffalo, calf,
           cavy, cheetah, deer, dolphin, elephant,
           fruitbat, giraffe, girl, goat, gorilla, hamster,
           hare, leopard, lion, lynx, mink, mole, mongoose,
           opossum, oryx, platypus, polecat, pony,
           porpoise, puma, pussycat, raccoon, reindeer,
           seal, sealion, squirrel, vampire, vole, wallaby,wolf
    2 (20) chicken, crow, dove, duck, flamingo, gull, hawk,
           kiwi, lark, ostrich, parakeet, penguin, pheasant,
           rhea, skimmer, skua, sparrow, swan, honeybee, wren
    3 (5)  pitviper, seasnake, slowworm, tortoise, tuatara 
    4 (13) bass, carp, catfish, chub, dogfish, haddock,
           herring, pike, piranha, seahorse, sole, stingray, tuna
    5 (4)  frog, frog, newt, toad 
    6 (8)  flea, gnat, honeybee, honeybee, ladybird, honeybee, 
           termite, wasp
    7 (10) clam, crab, crayfish, lobster, octopus,
           scorpion, seawasp, slug, starfish, worm

Attribute Information: (name of attribute and type of value domain)

   
1. animal name:  Unique for each instance
2. hair		    Boolean
3. feathers		Boolean
4. eggs		    Boolean
5. milk		    Boolean
6. airborne		Boolean
7. aquatic		Boolean
8. predator		Boolean
9. toothed		Boolean10. backbone		Boolean
11. breathes	Boolean12. venomous		Boolean
13. fins		Boolean
14. legs		Numeric (set of values: {0,2,4,5,6,8})
15. tail		Boolean
16. domestic	Boolean
17. catsize		Boolean
18. type		Numeric (integer values in range [1,7])


Let's import the zoo data file & Review

import pandas as pd
df = pd.read_csv("zoo.csv")


#let's review the first 15 records
df.head(15)

We will use the Cosine Similarity function to do the following:

  1. Randomly select a target animal from the animal list
  2. Compute the similarity score of the selected animal in step 1 against all the other animals
  3. Review top 20 records from sorted similarity score list of based on the target animals (descending order). This should give us the most similar animals first and the least similar animals last.
  4. Determine the Class of the Target Animal using KNN (K-Nearest Neighbor) to determine the Class of the target Animal (Use K=7) ie: Have the 7 closest neighbors vote on what the target animal's class should be.
  5. Repeat the process a few times for different target animals
from IPython.display import display, HTML

import os
import math
import pandas as pd
import random as rd
import urllib.parse
import operator
from collections import Counter
from IPython.display import Image
from sklearn.metrics import confusion_matrix
import numpy as np

def show_animal(animal, caption="", \ path="./data/animals" , \ border=False):
    # open the directory & read the first file
    fname = os.listdir(path + "/" + animal)[1:2][0]  
    fpath = path + "/" + animal + "/" + fname
    borderstyle = ("border:solid red 2px;" if border else "")
    out = "<figure style=\"float:left;padding-left:10px;text-align:center;margin-top:10px;height:150px;\" >"
    out += "<figcaption>" + caption + "</figcaption>"
    out += "<img width=90 style=\""+borderstyle+"\" src=\""+"./data/animals/"+animal+"/"+urllib.parse.quote(fname)+"\"></img></figure>"
    return out
    
    
def cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

# Uses KNN (K-Nearest Neighbor) to determine target animal's class based on 
# N number of nearest neighbors (ie: if Animal X's 5 nearest neighbors 
# are all Insects, we will classify X as an Insect
def class_vote(neighbors=["animal:flea","animal:honeybee","animal:honeybee", \ "animal:honeybee", "animal:tortoise","animal:sealion","animal:catfish"]):

    votes = [df[df.animal_name==x.split(":")[1]].class_type.values[0] for x in neighbors]
    
    counter = Counter(votes)
    total = len(neighbors)
    z = { i : round(y/total,2) for i,y in counter.items() }
    sorted_z = sorted(z.items(), key=operator.itemgetter(1), reverse=True)

    # normalize
    topvalue = sorted_z[0][1]
    z = { i : round(y/topvalue,2) for i,y in sorted_z }

    return z

def classify_animal(target, zoo_vector, debug=False):
    
    target_vector = target[1]
    
    # compute similary of the target against all other animals
    sim = [( x[0], cosine_similarity(target_vector, x[1])) for x in zoo_vector]
    sim.sort(key=lambda x: x[1],reverse=True)
    sim

    outdf = pd.DataFrame(sim, columns=['name','score'])
    if (debug):
        display(outdf.head(10))

    # find the top 7 neighbors & determine vote
    topK = outdf.iloc[1:7].name.values.tolist()
    votes = class_vote(topK)
    #display(topK)

    if (debug):
        html = ""
        i = 0
        for x, score in outdf.values.tolist()[:20]:
            name = x.split(":")[1]
            caption = name + ": " + str(score)
            html += show_animal(name, caption=caption, border=(i == 0))
            i += 1

        display(HTML("<div>" + html + "</div>"))
        
    return (outdf.name[0], votes, target[2])


# main ------------------------------------------------------

df = pd.read_csv("./data/zoo.csv")

animals = [ ("animal:"+x[0], \
             [df.columns[i+1] + ":" + str(y) for i,y in enumerate(x[1:-1])],\
             x[-1:][0] ) \
           for x in df.iloc[:,:].values.tolist()]

animals2 = [ (x[0], Counter(x[1]), {'class_type':x[2]}) for x in animals]

#randomly select a target animal
predicts = []
for j in range(1):
    total = 5   # repeat 5 times
    correct = 0 
    outcome = []
    for i in range(total):
        choice = int(rd.random()*df.shape[0])
        targetAnimal = animals2[choice]
        name, predicted, actual = classify_animal(targetAnimal, animals2, True)
        pred_class = list(predicted.keys())[0]
        act_class = actual['class_type']
        outcome.append([act_class, pred_class])
        if (pred_class == act_class):  correct += 1
        print("classify >>>>>>>>>>>>>", name, pred_class, act_class)
    #print("Evaluation >>>>>>>>>>> ", correct/total)
    predicts.append(correct/total)

print("predicts", predicts)

results = pd.DataFrame(predicts,columns=['score'])
results

Here are the results

Result 1:
classify >>>>>>>>>>>>> animal:octopus 7 7
Result 2: 
classify >>>>>>>>>>>>> animal:honeybee 6 6
Result 3: 
Classify >>>>>>>>>>>>> animal:newt 5 5
Result 4: 
classify >>>>>>>>>>>>> animal:chicken 2 2
Result 5: 
classify >>>>>>>>>>>>> animal:hamster 1 1

K-Nearest Neighbor Classification

In order to classify the Target Animal X, we select K number of the most similar animals (ie: highest similarity scores or lowest similarity distance) and have each of the neighbors vote for the prediction of the target Animal. If all the Nearest Neighbors are mammals then the majority vote means that target Animal X is also a mammal.

In the above figure, the class of the Target X (Red) should be Blue since the majority of it's k=5 nearest neighbors are Blue (3 Blues & 2 Yellows)

Model Evaluation

How well did we do in classifying the animals?

How well did our model perform in predicting the correct classification based on Cosine Similarity and KNN classification? To evaluate the performance, I generated 100 random Target animals X and predicted their corresponding class that is then compared to the actual label.

Here is the outcome presented in a Confusion Matrix. We have a total of 7 classes in the Zoo Animals dataset.

Classes 1, 2, 4, 5 & 6 performed well (Normalized Score of 1)

Classes 3 performed the worst (0.17 score) since there are only 5 samples in the source data (pitviper, seasnake, slowworm, tortoise, tuatara) for that particular class. We see that class 3 is getting classified more frequently as Class 5 (frog, frog, newt, toad). I'd say that both classes 3 & 5 are reptiles so that's probably why the model is getting "confused". I also noticed that the source animal data is not entirely "clean" (ex: two samples of Frogs with slight difference in attributes)

Animals Classes:

    1 (41) aardvark, antelope, bear, boar, buffalo, calf,
           cavy, cheetah, deer, dolphin, elephant,
           fruitbat, giraffe, girl, goat, gorilla, hamster,
           hare, leopard, lion, lynx, mink, mole, mongoose,
           opossum, oryx, platypus, polecat, pony,
           porpoise, puma, pussycat, raccoon, reindeer,
           seal, sealion, squirrel, vampire, vole, wallaby,wolf

    2 (20) chicken, crow, dove, duck, flamingo, gull, hawk,
           kiwi, lark, ostrich, parakeet, penguin, pheasant,
           rhea, skimmer, skua, sparrow, swan, honeybee, wren

    3 (5)  pitviper, seasnake, slowworm, tortoise, tuatara
 
    4 (13) bass, carp, catfish, chub, dogfish, haddock,
           herring, pike, piranha, seahorse, sole, stingray, tuna

    5 (4)  frog, frog, newt, toad 

    6 (8)  flea, gnat, honeybee, honeybee, ladybird, honeybee, 
           termite, wasp

    7 (10) clam, crab, crayfish, lobster, octopus,
           scorpion, seawasp, slug, starfish, worm

Conclusion

We are able to see that classification of large sets can be accomplished using a combination of the Cosine function to calculate the distance of the observations in high dimensional space and using K-Nearest Neighbors to classify. KNN works well when the sample data has a large number of elements in that class and does not perform as well when a particular Class is small.


Michael Lin

?https://www.dhirubhai.net/in/michaelclin/

























Hakima Massioun

A étudié à Université Abderrahmane Mira de Béja?a

6 个月

Quelqu'un ici peut m'aider à résoudre un problème de classification basé sur le calcul de similarity

回复
Mário de Sá Vera

DD&IT Data Products | Assoc Director at Novartis

3 年

Very clear exposition ! Just a note that some libraries took the not as clear approach and defined the distance doing [1 - D ] already so you don't need to apply that in your code (e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) ...

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了