strategies for finding duplicate mailing addresses

后端 未结 6 1629
悲哀的现实
悲哀的现实 2021-02-10 02:08

I\'m trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = \'# 3 FAIRMONT LIN         


        
6条回答
  •  忘掉有多难
    2021-02-10 02:40

    Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.

    Take for example this address

    56 5th avenue
    

    And this

    5, 65th avenue
    

    with your method both of them will be:

    565THAV
    

    What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.

    The algorithm can go like this:

    1. replace all commas dashes with spaces. Use he translate method for that.
    2. Build a dictionary with words and their abbreviated form
    3. Remove the TH part if it was following a number.

提交回复
热议问题