To Be Similar or Not to Be
by Dmytro Ustynov
That is the question.
Whether ‘tis nobler to check poems word by word
Among hundreds of thousands of words
And poems.
Or try to find a better way
And with reducing the size of search
Fetch the exact answer instantly?
The main idea of searching for similar texts is exactly what Hamlet said four hundred years ago – to try to reduce the dimension of what you are searching for.
Sorry, my daughter just said that Hamlet didn’t exactly talk about near similar search among a large corpus of texts, but about something else…
In any case, searching near-similar documents is often an important data-science task, that is usually solved in search engines, in plagiarism detection, in deduplication of search results, and so on.
Near-similar search differs from another type of search – full-text search. The full-text search means “find among a set of texts every text which contains A string”, where “A string” is quite short. Meanwhile, near-similar search asks “find among a set of texts every text that is similar to A text”, where “A text” may be quite long.
So searching for similar text in the same way we usually search for full text is not an option, because it may take far too much time (days instead of seconds).
The solution, as Hamlet… actually Hamming, proposed – is to reduce the size of every text you have, make a ‘fingerprint’ of each of them, and then just compare their fingerprints! Thank you, Hamlet (Hamming)!
And what can take something as input and significantly reduce the size in output?
领英推荐
The hash function.
One of the most important features that the hash function has is the constant length of output regardless of the size of the input. Whatever length you have as the input, you will have a constant length as the output, so we can reduce the dimension by a hundred-fold!
But, wait, that was only one part of the problem. Reducing the dimension will not work without making fingerprints of similar text very similar too. As we know, some types of hashing can create different hashes for two strings, if you even change a single character. That is a beautiful and desirable feature of the cryptographic hashing function: You will never be able to find the input as a function of the output. And if the input changes even in a single character, the output will be completely different. As a consequence, you would also never be able to find a similar hash or similar text.
That’s not the type of hash we want to apply, so let’s ask Hamlet again what he said about it.
To hash – to mix.
No more; and hashing that should be
Such that if you take two poems that differ less
The hashes you create for them
Shall also differ less or even equal be.
Wow! Hamlet was awesome! The solution is to make hashes of similar texts even more similar! Or even equal, if texts do not differ, we can consider them very close or similar.
And what hash function can take two very similar inputs and create very similar or even equal output?
The locality-sensitive hash.
According to?Wikipedia:
Locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search.
Great, exactly what we need.
So, long story short – that’s how we start to solve the similarity search task: create hashes and search among them, considering similar texts have similar hashes.
The next stage is how to store all of the hashes for hundreds of thousands of text sections, and how to efficiently calculate the similarity between the hashes. For this, we’ve developed some powerful SQL queries, which we’ll describe in future posts.