String Similarity/ Distance Algorithms

In this article I want to write about String Similarity and Distance algorithms, Their Usages and Java Implementations. These kind of algorithms have vast area of usage and application. for example you can use them to implement search engines in order to find similar keywords, you can use them for spell checking or Correcting writing typo. for Example Jaro-Winkler Algorithm is useful for spell Checking and name/family typo checking. If you are java developer like me hopefully there is a project called java-string-similarity in git which you can clone it and finding/learning how it works and adopting it to your possible scenarios also you can hack through the classes and methods to see how algorithms have been implemented.

one of the most simple string distance algorithms is Levenshtein distance algorithm which you can find it in java-string-similarity maven repository.

This is the Formula of Levenshtein distance algorithm. further in this article I show you some code example in order to see how Levenshtein distance and other algorithms work and what is the output if you compare some strings using them. Here I quote from Wikipedia: In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Following is the Implementation of Levenshtein Distance Algorithm:

public final double distance(String s1, String s2) {
    if(s1.equals(s2)) {
        return 0.0D;
    } else if(s1.length() == 0) {
        return (double)s2.length();
    } else if(s2.length() == 0) {
        return (double)s1.length();
    } else {
        int[] v0 = new int[s2.length() + 1];
        int[] v1 = new int[s2.length() + 1];

        int i;
        for(i = 0; i < v0.length; v0[i] = i++) {
            ;
        }

        for(i = 0; i < s1.length(); ++i) {
            v1[0] = i + 1;

            for(int j = 0; j < s2.length(); ++j) {
                byte cost = 1;
                if(s1.charAt(i) == s2.charAt(j)) {
                    cost = 0;
                }

                v1[j + 1] = Math.min(v1[j] + 1, Math.min(v0[j + 1] + 1, v0[j] + cost));
            }

            int[] vtemp = v0;
            v0 = v1;
            v1 = vtemp;
        }

        return (double)v0[s2.length()];
    }
}

Some of the String Similarity/ Distance algorithms are base on n-gram also called Shingles. here is the definition of these kind of algorithms I found in Github: tdebatty/java-string-similarity:

A few algorithms work by converting strings into sets of n-grams (sequences of n characters, also sometimes called k-shingles). The similarity or distance between the strings is then the similarity or distance between the sets.

Some of them, like Jaccard, consider strings as sets of shingles, and don't consider the number of occurrences of each shingle. Others, like cosine similarity, work using what is sometimes called the profile of the strings, which takes into account the number of occurrences of each shingle.

For these algorithms, another use case is possible when dealing with large data sets: 1. compute the set or profile representation of all the strings 2. compute the similarity between sets or profiles.

In order to use Java-String-Similarity add the following dependency using Maven or Gradle to your project:

<dependency>
    <groupId>info.debatty</groupId>
    <artifactId>java-string-similarity</artifactId>
    <version>RELEASE</version>
</dependency>

After adding it you will find Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity and etc. all in one place and ready to be used easily. Also there are various implementation of same algorithms in the package as the original one or normalized one.

You should take into consideration that difference between these algorithms is not related to their time/ memory Complexity (Big O). most of them have been devised in for solving special kind of problems/ Scenarios that won't be recommended to use in other scenarios.

I used Jaro-Winkler in one of my projects for correcting mistakes of the user when inputting some paths or files. I found that most of the time user Input path of files wrong but very similar to the real path. I used Jaro-winkler and a Configurable Threshold that can be tweaked in config files. when user input a filename wrong I calculate the Jaro-Winkler distance of files in the same path and Sort them by the lowest distance and if the lowest distance was equal or lesser than threshold I recommend that filename/path to the user as the possible right one I guess she wanted to input it. Please notice that orange squares are only considered for distance calculation because other squares in the example picture are more far to each other base on a value you can find it in Wikipedia. following is the description of the Jaro-Winkler Algorithm:

Jaro-Winkler is a string edit distance that was developed in the area of record linkage (duplicate detection) (Winkler, 1990). The Jaro–Winkler distance metric is designed and best suited for short strings such as person names, and to detect typos. Jaro-Winkler computes the similarity between 2 strings, and the returned value lies in the interval [0.0, 1.0]. It is (roughly) a variation of Damerau-Levenshtein, where the substitution of 2 close characters is considered less important then the substitution of 2 characters that a far from each other. The distance is computed as 1 - Jaro-Winkler similarity. Please see wikipedia in order to see some example of how Jaro-Winkler can be calculated.

following is the code snippet to show you how Jaro-Winkler algorithms looks like:

System.out.println("********************Jaro Winkler***********************");
JaroWinkler jr = new JaroWinkler();

System.out.println("jr dist : " + jr.distance("Touraj001.mp4" , "Touraj001.mp4"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.mp4"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj001.mp5"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.mp5"));
System.out.println("jr dist2 : " + jr.distance("Touraj001.mp4" , "Touraj 001.avi"));
System.out.println("jr dist2 : " + jr.distance("Mostafa001.mp4" , "Touraj 001.avi"));

the output is:

********************Jaro Winkler***********************
jr dist : 0.0
jr dist2 : 0.01360542433602474
jr dist2 : 0.003944777525388243
jr dist2 : 0.041862896510532877
jr dist2 : 0.09837780679975239
jr dist2 : 0.4285714030265808

Note: Lower value in output means that two strings are more similar to each other. instead of distance method you can use similarity method like this:

System.out.println("********************Jaro Winkler Similarity***********************");
JaroWinkler jr = new JaroWinkler();

System.out.println("jr dist : " + jr.similarity("Touraj001.mp4" , "Touraj001.mp4"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.mp4"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj001.mp5"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.mp5"));
System.out.println("jr dist2 : " + jr.similarity("Touraj001.mp4" , "Touraj 001.avi"));
System.out.println("jr dist2 : " + jr.similarity("Mostafa001.mp4" , "Touraj 001.avi"));

the output is :

********************Jaro Winkler Similarity***********************
jr dist : 1.0
jr dist2 : 0.9863945756639753
jr dist2 : 0.9960552224746118
jr dist2 : 0.9581371034894671
jr dist2 : 0.9016221932002476
jr dist2 : 0.5714285969734192

Note: This time higher value in output means more similarity between them and range is [0-1].

Following is the code snippet to get feeling of Levenshtein Algorithm:

System.out.println("****************************Levenshtein***************************************");

Levenshtein lv = new Levenshtein();

System.out.println("LV Dist : " + lv.distance("1234", "1234"));
System.out.println("LV Dist : " + lv.distance("1234", "1234abcd"));
System.out.println("LV Dist : " + lv.distance("Sittin", "Sitting"));
System.out.println("LV Dist : " + lv.distance("Sittin", "kittin"));
System.out.println("LV Dist : " + lv.distance("Sittin", "kitting"));

The output is:

****************************Levenshtein***************************************
LV Dist : 0.0
LV Dist : 4.0
LV Dist : 1.0
LV Dist : 1.0
LV Dist : 2.0

There are other algorithms that work base on Cosine Similarity and the formula of : cos(alpha) = V1 . V2 / (|V1| * |V2|). Please take a look at the following picture I captured from Wikipedia:

I think that's enough writing about string similarity. you can find more examples in Git and Wikipedia and dig into the bits and gears of these kind of algorithms, Their performance, implementation and Mathematics and Idea behind them.

Now for having fun listen to Rammstein:

Take care friends. I had a very bad day today! ... I am trying to get my concentration back again.

要查看或添加评论,请登录

Touraj Ebrahimi的更多文章

  • Kafka and Kafka Connect

    Kafka and Kafka Connect

    ?? ??? ????? ???? ??? ? ????? kafka ? Zookeeper ?? ????? ?? ?????? ? ?????? ?? ?????? Producer & Consumer ????? ????…

    5 条评论
  • Classic Strategy game Simulator using CountDownLatch and Multi-Threading in Java

    Classic Strategy game Simulator using CountDownLatch and Multi-Threading in Java

    In my previous article I wrote about what Is CyclicBarrier in Multi-Threading and I Discussed about it. Also In the…

    1 条评论
  • CyclicBarrier in Java Muti-Threading

    CyclicBarrier in Java Muti-Threading

    There are many examples for CyclicBarrier in many sites but I think that they are not good enough to show application…

  • OWASP

    OWASP

    In this Article I want to Write About OWASP. What is OWASP? Why we Should Implement its recommendations? and how to…

    4 条评论
  • ETCD

    ETCD

    In this article I want to discuss about the emerging world of In-Memory databases that are vastly used these days in…

    4 条评论
  • ActiveMQ persistence mode VS. Non-Persistence

    ActiveMQ persistence mode VS. Non-Persistence

    It this article I want to write about activemq and trade of between using persistence mode and non-persistence mode. As…

  • How to write Iterable Class in Java

    How to write Iterable Class in Java

    In this article I want to show how we can write Iterable Class in java. I am Fan of Designing Patterns and I try yo use…

  • AlloyUI

    AlloyUI

    In this Article I want to share my experience with other Developers about graceful AlloyUI framework that has been…

    2 条评论
  • Hashing (MD5 or SHA-1/2/3)

    Hashing (MD5 or SHA-1/2/3)

    In this Article I want to discuss about Hashing Algorithms and their Cons and Pros also Security issues and…

  • Game Design, Modeling and Programming

    Game Design, Modeling and Programming

    The process of Desiging and Creating A Game is very Hard and Time-Consuming. I Create, Design and Code Games All By…

    2 条评论

社区洞察

其他会员也浏览了