String Similarity/ Distance Algorithms
In this article I want to write about String Similarity and Distance algorithms, Their Usages and Java Implementations. These kind of algorithms have vast area of usage and application. for example you can use them to implement search engines in order to find similar keywords, you can use them for spell checking or Correcting writing typo. for Example Jaro-Winkler Algorithm is useful for spell Checking and name/family typo checking. If you are java developer like me hopefully there is a project called java-string-similarity in git which you can clone it and finding/learning how it works and adopting it to your possible scenarios also you can hack through the classes and methods to see how algorithms have been implemented.
one of the most simple string distance algorithms is Levenshtein distance algorithm which you can find it in java-string-similarity maven repository.
This is the Formula of Levenshtein distance algorithm. further in this article I show you some code example in order to see how Levenshtein distance and other algorithms work and what is the output if you compare some strings using them. Here I quote from Wikipedia: In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.
Following is the Implementation of Levenshtein Distance Algorithm:
public final double distance(String s1, String s2) {
if(s1.equals(s2)) {
return 0.0D;
} else if(s1.length() == 0) {
return (double)s2.length();
} else if(s2.length() == 0) {
return (double)s1.length();
} else {
int[] v0 = new int[s2.length() + 1];
int[] v1 = new int[s2.length() + 1];
int i;
for(i = 0; i < v0.length; v0[i] = i++) {
;
}
for(i = 0; i < s1.length(); ++i) {
v1[0] = i + 1;
for(int j = 0; j < s2.length(); ++j) {
byte cost = 1;
if(s1.charAt(i) == s2.charAt(j)) {
cost = 0;
}
v1[j + 1] = Math.min(v1[j] + 1, Math.min(v0[j + 1] + 1, v0[j] + cost));
}
int[] vtemp = v0;
v0 = v1;
v1 = vtemp;
}
return (double)v0[s2.length()];
}
}
Some of the String Similarity/ Distance algorithms are base on n-gram also called Shingles. here is the definition of these kind of algorithms I found in Github: tdebatty/java-string-similarity:
A few algorithms work by converting strings into sets of n-grams (sequences of n characters, also sometimes called k-shingles). The similarity or distance between the strings is then the similarity or distance between the sets.
Some of them, like Jaccard, consider strings as sets of shingles, and don't consider the number of occurrences of each shingle. Others, like cosine similarity, work using what is sometimes called the profile of the strings, which takes into account the number of occurrences of each shingle.
For these algorithms, another use case is possible when dealing with large data sets: 1. compute the set or profile representation of all the strings 2. compute the similarity between sets or profiles.
In order to use Java-String-Similarity add the following dependency using Maven or Gradle to your project:
<dependency>
<groupId>info.debatty</groupId>
<artifactId>java-string-similarity</artifactId>
<version>RELEASE</version>
</dependency>
After adding it you will find Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity and etc. all in one place and ready to be used easily. Also there are various implementation of same algorithms in the package as the original one or normalized one.
You should take into consideration that difference between these algorithms is not related to their time/ memory Complexity (Big O). most of them have been devised in for solving special kind of problems/ Scenarios that won't be recommended to use in other scenarios.
I used Jaro-Winkler in one of my projects for correcting mistakes of the user when inputting some paths or files. I found that most of the time user Input path of files wrong but very similar to the real path. I used Jaro-winkler and a Configurable Threshold that can be tweaked in config files. when user input a filename wrong I calculate the Jaro-Winkler distance of files in the same path and Sort them by the lowest distance and if the lowest distance was equal or lesser than threshold I recommend that filename/path to the user as the possible right one I guess she wanted to input it. Please notice that orange squares are only considered for distance calculation because other squares in the example picture are more far to each other base on a value you can find it in Wikipedia. following is the description of the Jaro-Winkler Algorithm:
Jaro-Winkler is a string edit distance that was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The Jaro–Winkler distance metric is designed and best suited for short strings such as person names, and to detect typos. Jaro-Winkler computes the similarity between 2 strings, and the returned value lies in the interval [0.0, 1.0]. It is (roughly) a variation of Damerau-Levenshtein, where the substitution of 2 close characters is considered less important then the substitution of 2 characters that a far from each other. The distance is computed as 1 - Jaro-Winkler similarity. Please see wikipedia in order to see some example of how Jaro-Winkler can be calculated.
following is the code snippet to show you how Jaro-Winkler algorithms looks like:
System.out.println("********************Jaro Winkler***********************");
JaroWinkler jr = new JaroWinkler();
System.out.println("jr dist : " + jr.distance("Touraj001.mp4" , "Touraj001.mp4"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.mp4"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj001.mp5"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.mp5"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.avi"));
System.out.println("jr dist2 : " + jr.distance("Mostafa001.mp4" , "Touraj 001.avi"));
the output is:
********************Jaro Winkler***********************
jr dist : 0.0
jr dist2 : 0.01360542433602474
jr dist2 : 0.003944777525388243
jr dist2 : 0.041862896510532877
jr dist2 : 0.09837780679975239
jr dist2 : 0.4285714030265808
Note: Lower value in output means that two strings are more similar to each other. instead of distance method you can use similarity method like this:
System.out.println("********************Jaro Winkler Similarity***********************");
JaroWinkler jr = new JaroWinkler();
System.out.println("jr dist : " + jr.similarity("Touraj001.mp4" , "Touraj001.mp4"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.mp4"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj001.mp5"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.mp5"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.avi"));
System.out.println("jr dist2 : " + jr.similarity("Mostafa001.mp4" , "Touraj 001.avi"));
the output is :
********************Jaro Winkler Similarity***********************
jr dist : 1.0
jr dist2 : 0.9863945756639753
jr dist2 : 0.9960552224746118
jr dist2 : 0.9581371034894671
jr dist2 : 0.9016221932002476
jr dist2 : 0.5714285969734192
Note: This time higher value in output means more similarity between them and range is [0-1].
Following is the code snippet to get feeling of Levenshtein Algorithm:
System.out.println("****************************Levenshtein***************************************");
Levenshtein lv = new Levenshtein();
System.out.println("LV Dist : " + lv.distance("1234", "1234"));
System.out.println("LV Dist : " + lv.distance("1234", "1234abcd"));
System.out.println("LV Dist : " + lv.distance("Sittin", "Sitting"));
System.out.println("LV Dist : " + lv.distance("Sittin", "kittin"));
System.out.println("LV Dist : " + lv.distance("Sittin", "kitting"));
The output is:
****************************Levenshtein***************************************
LV Dist : 0.0
LV Dist : 4.0
LV Dist : 1.0
LV Dist : 1.0
LV Dist : 2.0
There are other algorithms that work base on Cosine Similarity and the formula of : cos(alpha) = V1 . V2 / (|V1| * |V2|). Please take a look at the following picture I captured from Wikipedia:
I think that's enough writing about string similarity. you can find more examples in Git and Wikipedia and dig into the bits and gears of these kind of algorithms, Their performance, implementation and Mathematics and Idea behind them.
Now for having fun listen to Rammstein:
Take care friends. I had a very bad day today! ... I am trying to get my concentration back again.