How to calculate equal hash for similar strings?

大城市里の小女人 提交于 2019-12-24 17:52:11

问题


I create Antiplagiat. I use a shingle method. For example, I have the following shingles:

  1. I go to the cinema
  2. I go to the cinema1
  3. I go to th cinema

Is there a method of calculating the equal hash for these lines?

I know of the existence of Levenshtein distance. However, I do not know what I should take source word. Maybe there is a better way than to consider Levenshtein distance.


回答1:


The problem with hashing is that, logically, you'll run into 2 strings that differ by a single character that hash to different values.

Small proof:

Consider all possible strings.
Assume all of these hash to at least 2 different values.
Take any 2 strings A and B that hash to different values.
You can obviously go from A to B by just changing one character at a time.
Thus at some point the hash will change.
Thus at this point the hash will be different for a single character change.

Some options I can think of:

  • Hash multiple parts of the string and check each of these hashes. Probably won't work too well since a single character omission will cause significant difference in the hash values.

  • Check a range of hashes. A hash is one dimensional, but string similarity is not, thus this probably won't work either.

All in all, hashing is probably not the way to go.




回答2:


This questions is a bit old but you may be interested in this paper by two researchers at AT&T. They employ a technique that is reminiscent of the Nilsimsa hash to detect when similar sms messages have been seen an "abnormal" number of times in a time window.

It sounds Locality Sensitive hashing would also be pertinent to your problem.



来源:https://stackoverflow.com/questions/15377043/how-to-calculate-equal-hash-for-similar-strings

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!