登录查看更多内容

To Be Similar or Not to Be

DreamProIT

World-class software engineers + project management

发布日期: 2023年1月26日

+ 关注

by Dmytro Ustynov

That is the question.

Whether ‘tis nobler to check poems word by word

Among hundreds of thousands of words

And poems.

Or try to find a better way

And with reducing the size of search

Fetch the exact answer instantly?

The main idea of searching for similar texts is exactly what Hamlet said four hundred years ago – to try to reduce the dimension of what you are searching for.

Sorry, my daughter just said that Hamlet didn’t exactly talk about near similar search among a large corpus of texts, but about something else…

In any case, searching near-similar documents is often an important data-science task, that is usually solved in search engines, in plagiarism detection, in deduplication of search results, and so on.

Near-similar search differs from another type of search – full-text search. The full-text search means “find among a set of texts every text which contains A string”, where “A string” is quite short. Meanwhile, near-similar search asks “find among a set of texts every text that is similar to A text”, where “A text” may be quite long.

So searching for similar text in the same way we usually search for full text is not an option, because it may take far too much time (days instead of seconds).

The solution, as Hamlet… actually Hamming, proposed – is to reduce the size of every text you have, make a ‘fingerprint’ of each of them, and then just compare their fingerprints! Thank you, Hamlet (Hamming)!

And what can take something as input and significantly reduce the size in output?

领英推荐

Best Paraphrasing and Summarizing Tools

Hashe Computer Solutions (Pvt) Ltd. 2 年前

Addressing “AI-Giarism”: The New Age…

EthicsIndia – A Legasis Company 8 个月前

Keeping your work original - the importance of…

Melbourne City Institute of Education - MCIE 3 个月前

The hash function.

One of the most important features that the hash function has is the constant length of output regardless of the size of the input. Whatever length you have as the input, you will have a constant length as the output, so we can reduce the dimension by a hundred-fold!

But, wait, that was only one part of the problem. Reducing the dimension will not work without making fingerprints of similar text very similar too. As we know, some types of hashing can create different hashes for two strings, if you even change a single character. That is a beautiful and desirable feature of the cryptographic hashing function: You will never be able to find the input as a function of the output. And if the input changes even in a single character, the output will be completely different. As a consequence, you would also never be able to find a similar hash or similar text.

That’s not the type of hash we want to apply, so let’s ask Hamlet again what he said about it.

To hash – to mix.

No more; and hashing that should be

Such that if you take two poems that differ less

The hashes you create for them

Shall also differ less or even equal be.

Wow! Hamlet was awesome! The solution is to make hashes of similar texts even more similar! Or even equal, if texts do not differ, we can consider them very close or similar.

And what hash function can take two very similar inputs and create very similar or even equal output?

The locality-sensitive hash.

According to?Wikipedia:

Locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same “buckets” with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search.

Great, exactly what we need.

So, long story short – that’s how we start to solve the similarity search task: create hashes and search among them, considering similar texts have similar hashes.

The next stage is how to store all of the hashes for hundreds of thousands of text sections, and how to efficiently calculate the similarity between the hashes. For this, we’ve developed some powerful SQL queries, which we’ll describe in future posts.

To Be Similar or Not to Be

DreamProIT

World-class software engineers + project management

领英推荐

DreamProIT的更多文章

社区洞察

其他会员也浏览了

Inspired or Imitated? Next Evolution in Plagiarism Detection Debunked!

The Best AI Essay Writer You Can't Ignore in 2025!

10 Best AI Plagiarism Checkers and AI Content Detectors

How to Evade Copyleaks Daily Limits

Top 5 Reasons to Use MyPerfectWords.com's EssayBot

A Case of Plagiarism in the IEEE Article

5 Best AI Essay Writers in February 2024(Compared)

Copy & Paste Ideas! Is That Plagiarism?

Measuring Similarity: The Secret Sauce Behind Machine Intelligence

Plagiarism Detection Systems: Safeguarding Integrity or Stifling Creativity?

领英推荐

DreamProIT的更多文章

Test Automation: What you need to know

From Spark to Strategy: Turning Ideas into Reality

Terraform and Terragrunt: What They Are and Why They Matter

How AI is transforming the classification of U.S. Congressional Bills

Case Study: Using Grafana Dashboards for bill-processing ETL pipeline visualization

Active Listening: The Secret to Project Management Success

Serverless Computing: A Deep Dive

Docker Debugging: How to Do it Right?

The Unique Challenges of Start-Up Recruitment

Terraform and Terragrunt: What and Why

社区洞察

其他会员也浏览了

Inspired or Imitated? Next Evolution in Plagiarism Detection Debunked!

The Best AI Essay Writer You Can't Ignore in 2025!

10 Best AI Plagiarism Checkers and AI Content Detectors

How to Evade Copyleaks Daily Limits

Top 5 Reasons to Use MyPerfectWords.com's EssayBot

A Case of Plagiarism in the IEEE Article

5 Best AI Essay Writers in February 2024(Compared)

Copy & Paste Ideas! Is That Plagiarism?

Measuring Similarity: The Secret Sauce Behind Machine Intelligence

Plagiarism Detection Systems: Safeguarding Integrity or Stifling Creativity?