Latin Regex with symbols

帅比萌擦擦* 提交于 2021-02-05 05:51:27

问题


I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% "  ' : ; > < / \  | ,  here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?

Some of results are:

dresse     // OK
sud-est    // OK
occident)  // WRONG
987        // OK
()         // WRONG
(a         // WRONG
*          // WRONG
-          // WRONG
+          // WRONG
(          // WRONG
|          // WRONG

The regex explanation is:

[^\p{L}+(\-\p{L}+)*\d]+

 * Word separator will be:
 *     [^  ...  ]  No sequence in:
 *     \p{L}+        Any latin letter
 *     (\-\p{L}+)*   Optionally hyphenated
 *     \d            or numbers
 *     [ ... ]+      once or more.

回答1:


If my understanding of your requirement is correct, this regex will match what you want:

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

It will match:

  • A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
  • Or several such sequences, hyphenated
  • Or a contiguous sequence of decimal digits (0-9)

The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);

while (matcher.find()) {
    System.out.println(matcher.group());
}

If you want to allow words with apostrophe ':

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.




回答2:


If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.

Note that characters like +,(,),* etc. don't have any special meaning inside a character class.

What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.

For example consider this

Pattern pattern = Pattern.compile("a");
    for (String s : pattern.split("sda  a  f  g")) {
        System.out.println("==>"+s);
    }

Output will be

==>sd

==>

==> f g




回答3:


A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.



来源:https://stackoverflow.com/questions/14833001/latin-regex-with-symbols

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!