Latin Regex with symbols

问题

I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% "  ' : ; > < / \  | ,  here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?

Some of results are:

dresse     // OK
sud-est    // OK
occident)  // WRONG
987        // OK
()         // WRONG
(a         // WRONG
*          // WRONG
-          // WRONG
+          // WRONG
(          // WRONG
|          // WRONG

The regex explanation is:

[^\p{L}+(\-\p{L}+)*\d]+

 * Word separator will be:
 *     [^  ...  ]  No sequence in:
 *     \p{L}+        Any latin letter
 *     (\-\p{L}+)*   Optionally hyphenated
 *     \d            or numbers
 *     [ ... ]+      once or more.

回答1:

If my understanding of your requirement is correct, this regex will match what you want:

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

It will match:

A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
Or several such sequences, hyphenated
Or a contiguous sequence of decimal digits (0-9)

The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);

while (matcher.find()) {
    System.out.println(matcher.group());
}

If you want to allow words with apostrophe ':

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.

回答2:

If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.

Note that characters like +,(,),* etc. don't have any special meaning inside a character class.

What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.

For example consider this

Pattern pattern = Pattern.compile("a");
    for (String s : pattern.split("sda  a  f  g")) {
        System.out.println("==>"+s);
    }

Output will be

==>sd

==>

==> f g

回答3:

A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.

来源：https://stackoverflow.com/questions/14833001/latin-regex-with-symbols

标签

java

regex

split

symbols

latin