Detect duplicated/similar text among large datasets?

问题

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

回答1:

http://d3s.mff.cuni.cz/~holub/sw/shash/

http://matpalm.com/resemblance/simhash/

回答2:

This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.

The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.

EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.

回答3:

Look at shngle-min-hash techniques. Here is a presentation that could help you.

回答4:

One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

来源：https://stackoverflow.com/questions/4067105/detect-duplicated-similar-text-among-large-datasets

标签

algorithm

similarity