Near Duplicate Detection in Data Streams

问题

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.

I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.

But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.

I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?

回答1:

First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.

回答2:

http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

来源：https://stackoverflow.com/questions/10348929/near-duplicate-detection-in-data-streams

标签

streaming

duplicates

filtering

bloom-filter