发表新帖

发表新帖

Fuzzy matching deduplication in less than exponential time?

前端未结

关注

 6  1920

轻奢々 2020-12-07 23:41

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for

6条回答

广开言路 (楼主)

2020-12-08 00:02

Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"

0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题