Detect duplicated/similar text among large datasets?

断了今生、忘了曾经 提交于 2019-12-05 02:37:55

问题


I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes


回答1:


http://d3s.mff.cuni.cz/~holub/sw/shash/

http://matpalm.com/resemblance/simhash/




回答2:


This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.

The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.

EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.




回答3:


Look at shngle-min-hash techniques. Here is a presentation that could help you.




回答4:


One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.



来源:https://stackoverflow.com/questions/4067105/detect-duplicated-similar-text-among-large-datasets

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!