Regex for splitting a german address into its parts

限于喜欢 提交于 2019-12-04 12:19:45

问题


Good evening,

I'm trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like the following:

Name der Straße 25a 88489 Teststadt
to
Name der Straße|25a|88489|Teststadt

or

Teststr. 3 88489 Beispielort (Großer Kreis)
to
Teststr.|3|88489|Beispielort (Großer Kreis)

It would be perfect if the system / regex would still work if parts like the zip code or the city are missing.

Is there any regex or library out there with which I could archive this?

EDIT: Rule for german addresses:
Street: Characters, numbers and spaces
House no: Number and any characters (or space) until a series of numbers (zip) (at least in these examples)
Zip: 5 digits
Place or City: The rest maybe also with spaces, commas or braces


回答1:


I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:

/^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i

Here are some example matches.

It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.

[a-zäöüß\s\d,.-]+?                         # Street name (lazy)
[\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?     # Street number (optional)

After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.




回答2:


I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.

The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range. Anything before that is the street name.

Anyway, here we go:

^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$

This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.

Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.

As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.

Still, here’s a somewhat more legible regular expression:

^
(?<street>(?:\p{L}|\ |\d|\.|-)+?)\ 
(?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\ 
(?<zip>\d{5})\ 
(?<city>(?:\p{L}|\ |-)+)
(?:\ *\((?<suffix>[^\)]+)\))?
$

In Java 7, the closest we can achieve is this (untested; may contain typos):

String pattern =
    "^" +
    "(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
    "(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
    "(?<zip>\\d{5}) " +
    "(?<city>(?:\\p{L}| |-)+)" +
    "(?: *\\((?<suffix>[^\\)]+)\\))?" +
    "$";



回答3:


Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.

Regex Pattern:

^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
  • Group 1: Street
  • Group 2: House no.
  • Group 3: ZIP
  • Group 4: City



回答4:


public static void main(String[] args) {
    String data = "Name der Strase 25a 88489 Teststadt";
    String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";

    Pattern pattern = Pattern.compile(regexp);
    Matcher matcher = pattern.matcher(data);
    boolean matchFound = matcher.find();

    if (matchFound) {
        // Get all groups for this match
        for (int i=0; i<=matcher.groupCount(); i++) {
            String groupStr = matcher.group(i);
            System.out.println(groupStr);
        }
    }System.out.println("nothing found");
                }

I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.

I recommend to visit this it's a great site about regular expressions. Good luck!




回答5:


At first glance it looks like a simple whitespace would do it, however looking closer I notice the address always has 4 parts, and the first part can have whitespace.

What I would do is something like this (psudeocode):

address[4] = empty
split[?] = address_string.split(" ")
address[3] = split[last]
address[2] = split[last - 1]
address[1] = split[last - 2]
address[0] = join split[first] through split[last - 3] with whitespace, trim trailing whitespace with trim()

However, this will only handle one form of address. If addresses are written multiple ways it could be much more tricky.




回答6:


try this:

^[^\d]+[\d\w]+(\s)\d+(\s).*$

It captures groups for each of the spaces that delimits 1 of the 4 sections of the address

OR

this one gives you groups for each of the address parts:

^([^\d]+)([\d\w]+)\s(\d+)\s(.*)$

I don't know java, so not sure the exact code to use for replacing captured groups.



来源:https://stackoverflow.com/questions/9863630/regex-for-splitting-a-german-address-into-its-parts

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!