Cosine Similarity & Classification
Michael Lin ????
Solution Engineer @ PingCap | Dun & Bradstreet | Intel | Microsoft
Set similarity is a difficult problem to solve using traditional rule based programming. The problem exacerbates when there is a large number of attributes in the dataset. Unsurprisingly, Math comes to rescue. Using the Cosine function & K-Nearest Neighbor algorithm, we can determine how similar or different two sets of items are and use it to determine the classification. The Cosine function is used to calculate the Similarity or the Distance of the observations in high dimensional space.
For example here is a list of fruits & their attributes:
If two sets of attributes are identical, we'd like our similarity function to return One (1). If two sets of attributes are completely different, we'd like the function to return Zero (0). If there are some significant overlap between the two sets, we'd like the function to return a real number that is closer one (ex: 0.95) and vice versa the most dissimilar two sets of items are we want the items to approximate zero (ex: 0.05). Fortunately the positive values of cosine exhibit this behavior:
Cosine(0) = 1.0 # 0 degrees angle
Cosine(π/4) ≈ 0.7 # 45 degrees angle
Cosine(π/2) = 0.0 # 90 degrees angle
In physics we represent a point in an N-dimensional space using a vector. For example a seat at a Football stadium could be described by the distance relative from the center point of the stadium floor. If we add the time dimension, we would have a 4th dimension to represent the occupancy of that seat perhaps for different football games in a particular season. Similarly we can model our our Set Similarity problem in an N dimensional space where each dimension represent an attribute in our dataset. Since N could be really large, we typically say this model is a High Dimensional Vector Space. Every single observation can be represented as a point (and therefore, a vector) in this space.
If we want to compare how similar two items are, we represent each object or entity as a vector in N dimensional space first, then we calculate the Cosine value of the angle between those two vectors. Fortunately, this cosine value can be easily computed as the dot product of the two vectors divided by the product of their magnitudes.
In Vector Space Models, if Vector A is the same as Vector B, we know that the angle theta will be zero and thus the Similarity(A,B) where B = A, results in a value of One. ie: they are most similar or identical.
Here's a snippet of python code that computes the Cosine(θ) of two vectors:
import math
from collections import Counter
def cosine_similarity(v1, v2):
terms = set(v1).union(v2)
dotprod = sum(v1.get(k, 0) * v2.get(k, 0) for k in terms)
magA = math.sqrt(sum(v1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(v2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
#ref: quora
Let's try this Cosine Similarity function to see how it behaves:
# create 5 list of items
A = Counter(list("abcd"))
B = Counter(list("ab"))
C = Counter(list("abcde")) # very similar to A
D = Counter(list("abcd")) # identical to A
E = Counter(list("xyz")) # very different from A
A, B, C, D, E
#Output:
(Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1}),
Counter({'a': 1, 'b': 1}),
Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1, 'e': 1}),
Counter({'a': 1, 'b': 1, 'c': 1, 'd': 1}),
Counter({'x': 1, 'y': 1, 'z': 1}))
# now let's run the similarity comparisons
cosine_similarity(A,B), \
cosine_similarity(B,C), \
cosine_similarity(A,C), \
cosine_similarity(A,D), \
cosine_similarity(A,E)
# output
(0.7071067811865475, 0.6324555320336759, 0.8944271909999159, 1.0, 0.0)
# therefore as we expect
Sim(A,B) => 0.71 # some similarity to A
Sim(B,C) => 0.63 # some similarity btw B & C
Sim(A,C) => 0.89 # most similar to A
Sim(A,D) => 1.0 # identical to A
Sim(A,E) => 0.0 # different from A
Cosine Distance
Incidentally, Cosine Distance is defined as distance between two points in High Dimensional Space. It is defined as the value equals to 1 - Similarity(A, B). Therefore the range of the Cosine Distance ranges from 0 to 1 as well. If the Cosine Distance is zero (0), that means the items are identical. If the Cosine Distance is one (1) that means the items are definitely different.
Cosine_Distance (A, B) = 1 - Cosine_Similarity (A, B)
#following previous example
cosine_distance(A,B) => 0.29 # some similarity to A
cosine_distance(B,C) => 0.37 # some similarity btw B & C
cosine_distance(A,C) => 0.11 # most similar to A
cosine_distance(A,D) => 0.0 # identical to A
cosine_distance(A,E) => 1.0 # different from A
It is amazing that a simple Trigonometric math function can help us compute the similarity of sets.
Let's now put this function to good use by classifying animals using the Animals Dataset from UC Irvine.
Similarity Determination using the Zoo Animals Dataset
The animals dataset from UC Irvine consist of the following which we will use to find similar items using Cosine Similarity.
Animals / Zoo Dataset Information:
A database containing 101 instances of 17 Boolean-valued attributes.
The "type" attribute appears to be the class attribute.
Here is a breakdown of which animals classes
1 (41) aardvark, antelope, bear, boar, buffalo, calf,
cavy, cheetah, deer, dolphin, elephant,
fruitbat, giraffe, girl, goat, gorilla, hamster,
hare, leopard, lion, lynx, mink, mole, mongoose,
opossum, oryx, platypus, polecat, pony,
porpoise, puma, pussycat, raccoon, reindeer,
seal, sealion, squirrel, vampire, vole, wallaby,wolf
2 (20) chicken, crow, dove, duck, flamingo, gull, hawk,
kiwi, lark, ostrich, parakeet, penguin, pheasant,
rhea, skimmer, skua, sparrow, swan, honeybee, wren
3 (5) pitviper, seasnake, slowworm, tortoise, tuatara
4 (13) bass, carp, catfish, chub, dogfish, haddock,
herring, pike, piranha, seahorse, sole, stingray, tuna
5 (4) frog, frog, newt, toad
6 (8) flea, gnat, honeybee, honeybee, ladybird, honeybee,
termite, wasp
7 (10) clam, crab, crayfish, lobster, octopus,
scorpion, seawasp, slug, starfish, worm
Attribute Information: (name of attribute and type of value domain)
1. animal name: Unique for each instance
2. hair Boolean
3. feathers Boolean
4. eggs Boolean
5. milk Boolean
6. airborne Boolean
7. aquatic Boolean
8. predator Boolean
9. toothed Boolean10. backbone Boolean
11. breathes Boolean12. venomous Boolean
13. fins Boolean
14. legs Numeric (set of values: {0,2,4,5,6,8})
15. tail Boolean
16. domestic Boolean
17. catsize Boolean
18. type Numeric (integer values in range [1,7])
Let's import the zoo data file & Review
import pandas as pd
df = pd.read_csv("zoo.csv")
#let's review the first 15 records
df.head(15)
We will use the Cosine Similarity function to do the following:
- Randomly select a target animal from the animal list
- Compute the similarity score of the selected animal in step 1 against all the other animals
- Review top 20 records from sorted similarity score list of based on the target animals (descending order). This should give us the most similar animals first and the least similar animals last.
- Determine the Class of the Target Animal using KNN (K-Nearest Neighbor) to determine the Class of the target Animal (Use K=7) ie: Have the 7 closest neighbors vote on what the target animal's class should be.
- Repeat the process a few times for different target animals
from IPython.display import display, HTML
import os
import math
import pandas as pd
import random as rd
import urllib.parse
import operator
from collections import Counter
from IPython.display import Image
from sklearn.metrics import confusion_matrix
import numpy as np
def show_animal(animal, caption="", \ path="./data/animals" , \ border=False):
# open the directory & read the first file
fname = os.listdir(path + "/" + animal)[1:2][0]
fpath = path + "/" + animal + "/" + fname
borderstyle = ("border:solid red 2px;" if border else "")
out = "<figure style=\"float:left;padding-left:10px;text-align:center;margin-top:10px;height:150px;\" >"
out += "<figcaption>" + caption + "</figcaption>"
out += "<img width=90 style=\""+borderstyle+"\" src=\""+"./data/animals/"+animal+"/"+urllib.parse.quote(fname)+"\"></img></figure>"
return out
def cosine_similarity(c1, c2):
terms = set(c1).union(c2)
dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
return dotprod / (magA * magB)
# Uses KNN (K-Nearest Neighbor) to determine target animal's class based on
# N number of nearest neighbors (ie: if Animal X's 5 nearest neighbors
# are all Insects, we will classify X as an Insect
def class_vote(neighbors=["animal:flea","animal:honeybee","animal:honeybee", \ "animal:honeybee", "animal:tortoise","animal:sealion","animal:catfish"]):
votes = [df[df.animal_name==x.split(":")[1]].class_type.values[0] for x in neighbors]
counter = Counter(votes)
total = len(neighbors)
z = { i : round(y/total,2) for i,y in counter.items() }
sorted_z = sorted(z.items(), key=operator.itemgetter(1), reverse=True)
# normalize
topvalue = sorted_z[0][1]
z = { i : round(y/topvalue,2) for i,y in sorted_z }
return z
def classify_animal(target, zoo_vector, debug=False):
target_vector = target[1]
# compute similary of the target against all other animals
sim = [( x[0], cosine_similarity(target_vector, x[1])) for x in zoo_vector]
sim.sort(key=lambda x: x[1],reverse=True)
sim
outdf = pd.DataFrame(sim, columns=['name','score'])
if (debug):
display(outdf.head(10))
# find the top 7 neighbors & determine vote
topK = outdf.iloc[1:7].name.values.tolist()
votes = class_vote(topK)
#display(topK)
if (debug):
html = ""
i = 0
for x, score in outdf.values.tolist()[:20]:
name = x.split(":")[1]
caption = name + ": " + str(score)
html += show_animal(name, caption=caption, border=(i == 0))
i += 1
display(HTML("<div>" + html + "</div>"))
return (outdf.name[0], votes, target[2])
# main ------------------------------------------------------
df = pd.read_csv("./data/zoo.csv")
animals = [ ("animal:"+x[0], \
[df.columns[i+1] + ":" + str(y) for i,y in enumerate(x[1:-1])],\
x[-1:][0] ) \
for x in df.iloc[:,:].values.tolist()]
animals2 = [ (x[0], Counter(x[1]), {'class_type':x[2]}) for x in animals]
#randomly select a target animal
predicts = []
for j in range(1):
total = 5 # repeat 5 times
correct = 0
outcome = []
for i in range(total):
choice = int(rd.random()*df.shape[0])
targetAnimal = animals2[choice]
name, predicted, actual = classify_animal(targetAnimal, animals2, True)
pred_class = list(predicted.keys())[0]
act_class = actual['class_type']
outcome.append([act_class, pred_class])
if (pred_class == act_class): correct += 1
print("classify >>>>>>>>>>>>>", name, pred_class, act_class)
#print("Evaluation >>>>>>>>>>> ", correct/total)
predicts.append(correct/total)
print("predicts", predicts)
results = pd.DataFrame(predicts,columns=['score'])
results
Here are the results
Result 1:
classify >>>>>>>>>>>>> animal:octopus 7 7
Result 2:
classify >>>>>>>>>>>>> animal:honeybee 6 6
Result 3:
Classify >>>>>>>>>>>>> animal:newt 5 5
Result 4:
classify >>>>>>>>>>>>> animal:chicken 2 2
Result 5:
classify >>>>>>>>>>>>> animal:hamster 1 1
K-Nearest Neighbor Classification
In order to classify the Target Animal X, we select K number of the most similar animals (ie: highest similarity scores or lowest similarity distance) and have each of the neighbors vote for the prediction of the target Animal. If all the Nearest Neighbors are mammals then the majority vote means that target Animal X is also a mammal.
In the above figure, the class of the Target X (Red) should be Blue since the majority of it's k=5 nearest neighbors are Blue (3 Blues & 2 Yellows)
Model Evaluation
How well did we do in classifying the animals?
How well did our model perform in predicting the correct classification based on Cosine Similarity and KNN classification? To evaluate the performance, I generated 100 random Target animals X and predicted their corresponding class that is then compared to the actual label.
Here is the outcome presented in a Confusion Matrix. We have a total of 7 classes in the Zoo Animals dataset.
Classes 1, 2, 4, 5 & 6 performed well (Normalized Score of 1)
Classes 3 performed the worst (0.17 score) since there are only 5 samples in the source data (pitviper, seasnake, slowworm, tortoise, tuatara) for that particular class. We see that class 3 is getting classified more frequently as Class 5 (frog, frog, newt, toad). I'd say that both classes 3 & 5 are reptiles so that's probably why the model is getting "confused". I also noticed that the source animal data is not entirely "clean" (ex: two samples of Frogs with slight difference in attributes)
Animals Classes: 1 (41) aardvark, antelope, bear, boar, buffalo, calf, cavy, cheetah, deer, dolphin, elephant, fruitbat, giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, lynx, mink, mole, mongoose, opossum, oryx, platypus, polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, seal, sealion, squirrel, vampire, vole, wallaby,wolf 2 (20) chicken, crow, dove, duck, flamingo, gull, hawk, kiwi, lark, ostrich, parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, swan, honeybee, wren 3 (5) pitviper, seasnake, slowworm, tortoise, tuatara 4 (13) bass, carp, catfish, chub, dogfish, haddock, herring, pike, piranha, seahorse, sole, stingray, tuna 5 (4) frog, frog, newt, toad 6 (8) flea, gnat, honeybee, honeybee, ladybird, honeybee, termite, wasp 7 (10) clam, crab, crayfish, lobster, octopus, scorpion, seawasp, slug, starfish, worm
Conclusion
We are able to see that classification of large sets can be accomplished using a combination of the Cosine function to calculate the distance of the observations in high dimensional space and using K-Nearest Neighbors to classify. KNN works well when the sample data has a large number of elements in that class and does not perform as well when a particular Class is small.
?https://www.dhirubhai.net/in/michaelclin/
A étudié à Université Abderrahmane Mira de Béja?a
6 个月Quelqu'un ici peut m'aider à résoudre un problème de classification basé sur le calcul de similarity
DD&IT Data Products | Assoc Director at Novartis
3 年Very clear exposition ! Just a note that some libraries took the not as clear approach and defined the distance doing [1 - D ] already so you don't need to apply that in your code (e.g. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) ...