How to calculate equal hash for similar strings?

问题

I create Antiplagiat. I use a shingle method. For example, I have the following shingles:

I go to the cinema
I go to the cinema1
I go to th cinema

Is there a method of calculating the equal hash for these lines?

I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.

回答1:

The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.

Small proof:

Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.

Some options I can think of:

Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.
Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.

All in all, hashing is probably not the way to go.

回答2:

This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.

It sounds Locality Sensitive hashing would also be pertinent to your problem.

来源：https://stackoverflow.com/questions/15377043/how-to-calculate-equal-hash-for-similar-strings

标签

hash

levenshtein-distance