Fuzzy matching deduplication in less than exponential time?

前端 未结 6 1916
轻奢々
轻奢々 2020-12-07 23:41

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for

6条回答
  •  醉话见心
    2020-12-08 00:10

    Equivalence relations are particularly nice kinds of matching; they satisfy three properties:

    • reflexivity: for any value A, A ~ A
    • symmetry: if A ~ B, then necessarily B ~ A
    • transitivity: if A ~ B and B ~ C, then necessarily A ~ C

    What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.

    Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.

    So, what behavior do you want, exactly, in these situations?

提交回复
热议问题