I\'m trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:
addr_1 = \'# 3 FAIRMONT LIN
Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.
Take for example this address
56 5th avenue
And this
5, 65th avenue
with your method both of them will be:
565THAV
What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.
The algorithm can go like this:
TH part if it was following a number.