How to parse freeform street/postal address out of text, and into components

后端未结

关注

 9  1207

感动是毒 2020-11-22 13:40

We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single text area. But there are a few proble

9条回答

独厮守ぢ (楼主)

2020-11-22 14:29

There are many street address parsers. They come in two basic flavors - ones that have databases of place names and street names, and ones that don't.

A regular expression street address parser can get up to about a 95% success rate without much trouble. Then you start hitting the unusual cases. The Perl one in CPAN, "Geo::StreetAddress::US", is about that good. There are Python and Javascript ports of that, all open source. I have an improved version in Python which moves the success rate up slightly by handling more cases. To get the last 3% right, though, you need databases to help with disambiguation.

A database with 3-digit ZIP codes and US state names and abbreviations is a big help. When a parser sees a consistent postal code and state name, it can start to lock on to the format. This works very well for the US and UK.

Proper street address parsing starts from the end and works backwards. That's how the USPS systems do it. Addresses are least ambiguous at the end, where country names, city names, and postal codes are relatively easy to recognize. Street names can usually be isolated. Locations on streets are the most complex to parse; there you encounter things such as "Fifth Floor" and "Staples Pavillion". That's when a database is a big help.

0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...