strategies for finding duplicate mailing addresses

人盡茶涼 提交于 2020-01-01 03:28:06

问题


I'm trying to come up with a method of finding duplicate addresses, based on a similarity score. Consider these duplicate addresses:

addr_1 = '# 3 FAIRMONT LINK SOUTH'
addr_2 = '3 FAIRMONT LINK S'

addr_3 = '5703 - 48TH AVE'
adrr_4 = '5703- 48 AVENUE'

I'm planning on applying some string transformation to make long words abbreviated, like NORTH -> N, remove all spaces, commas and dashes and pound symbols. Now, having this output, how can I compare addr_3 with the rest of addresses and detect similar? What percentage of similarity would be safe? Could you provide a simple python code for this?

addr_1 = '3FAIRMONTLINKS'
addr_2 = '3FAIRMONTLINKS'

addr_3 = '570348THAV'
adrr_4 = '570348AV'

Thankful,

Eduardo


回答1:


First, simplify the address string by collapsing all whitespace to a single space between each word, and forcing everything to lower case (or upper case if you prefer):

adr = " ".join(adr.tolower().split())

Then, I would strip out things like "st" in "41st Street" or "nd" in "42nd Street":

adr = re.sub("1st(\b|$)", r'1', adr)
adr = re.sub("([2-9])\s?nd(\b|$)", r'\1', adr)

Note that the second sub() will work with a space between the "2" and the "nd", but I didn't set the first one to do that; because I'm not sure how you can tell the difference between "41 St Ave" and "41 St" (that second one is "41 Street" abbreviated).

Be sure to read all the help for the re module; it's powerful but cryptic.

Then, I would split what you have left into a list of words, and apply the Soundex algorithm to list items that don't look like numbers:

http://en.wikipedia.org/wiki/Soundex

http://wwwhomes.uni-bielefeld.de/gibbon/Forms/Python/SEARCH/soundex.html

adrlist = [word if word.isdigit() else soundex(word) for word in adr.split()]

Then you can work with the list or join it back to a string as you think best.

The whole idea of the Soundex thing is to handle misspelled addresses. That may not be what you want, in which case just ignore this Soundex idea.

Good luck.




回答2:


Removing spaces, commas and dashes will be ambiguous . It will be better to replace them with a single space.

Take for example this address

56 5th avenue

And this

5, 65th avenue

with your method both of them will be:

565THAV

What you can do is write a good address shortening algorithm and then use string comparison to detect duplicates. This should be enough to detect duplicates in the general case. A general similarity algorithm won't work. Because one number difference can mean a huge change in Addresses.

The algorithm can go like this:

  1. replace all commas dashes with spaces. Use he translate method for that.
  2. Build a dictionary with words and their abbreviated form
  3. Remove the TH part if it was following a number.



回答3:


This should be helpful in building your dictionary of abbreviations:

http://www.usps.com/ncsc/lookups/usps_abbreviations.html




回答4:


In order to do this right, you need to standardize your addresses according to USPS standards (your address examples appear to be US based). There are many direct marketing service providers that offer CASS (Coding Accuracy Support System) certification of postal addresses. The CASS process will standardize all of your addresses and append zip + 4 to them. Any undeliverable addresses will be flagged which will further reduce your postal mailing costs, if that is your intent. Once all of your addresses are standardized, eliminating duplicates will be trivial.




回答5:


I had to do this once. I converted everything to lowercase, computed each address's Levenshtein distance to every other address, and ordered the results. It worked very well, but it was quite time-consuming.

You'll want to use an implementation of Levenshtein in C rather than in Python if you have a large data set. Mine was a few tens of thousands and took the better part of a day to run, I think.




回答6:


I regularly inspect addresses for duplication where I work, and I have to say, I find Soundex highly unsuitable. It's both too slow and too eager to match things. I have similar issues with Levenshtein distance.

What has worked best for me is to sanitize and tokenize the addresses (get rid of punctuation, split things up into words) and then just see how many tokens match up. Because addresses typically have several tokens, you can develop a level of confidence in terms of a combination of (1) how many tokens were matched, (2) how many numeric tokens were matched, and (3) how many tokens are available. For example, if all tokens in the shorter address are in the longer address, the confidence of a match is pretty high. Likewise, if you match 5 tokens including at least one that's numeric, even if the addresses each have 8, that's still a high-confidence match.

It's definitely useful to do some tweaking, like substituting some common abbreviations. The USPS lists help, though I wouldn't go gung-ho trying to implement all of them, and some of the most valuable substitutions aren't on those lists. For example, 'JFK' should be a match for 'JOHN F KENNEDY', and there are a number of common ways to shorten 'MARTIN LUTHER KING JR'.

Maybe it goes without saying but I'll say it anyway, for completeness: Don't forget to just do a straight string comparison on the whole address before messing with more complicated things! This should be a very cheap test, and thus is probably a no-brainer first pass.

Obviously, the more time you're willing and able to spend (both on programming/testing and on run time), the better you'll be able to do. Fuzzy string matching techniques (faster and less generalized kinds than Levenshtein) can be useful, as a separate pass from the token approach (I wouldn't try to fuzzy match individual tokens against each other). I find that fuzzy string matching doesn't give me enough bang for my buck on addresses (though I will use it on names).



来源:https://stackoverflow.com/questions/1369289/strategies-for-finding-duplicate-mailing-addresses

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!