Fuzzy matching deduplication in less than exponential time?

前端 未结 6 1920
轻奢々
轻奢々 2020-12-07 23:41

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for

6条回答
  •  广开言路
    2020-12-08 00:02

    Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

    A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"

提交回复
热议问题