I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?