Matching (e.g.) a Unicode letter with Java regexps

前端 未结 3 560
后悔当初
后悔当初 2020-12-09 17:50

There are many questions and answers here on StackOverflow that assume a \"letter\" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many

3条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-12-09 18:32

    Here you have a very nice explanation:

    http://www.regular-expressions.info/unicode.html

    Some hints:

    "Java and .NET unfortunately do not support \X (yet). Use \P{M}\p{M}* as a substitute. To match any number of graphemes, use (?:\P{M}\p{M}*)+ instead of \X+."

    "In Java, the regex token \uFFFF only matches the specified code point, even when you turned on canonical equivalence. However, the same syntax \uFFFF is also used to insert Unicode characters into literal strings in the Java source code. Pattern.compile("\u00E0") will match both the single-code-point and double-code-point encodings of à, while Pattern.compile("\\u00E0") matches only the single-code-point version. Remember that when writing a regex as a Java string literal, backslashes must be escaped. The former Java code compiles the regex à, while the latter compiles \u00E0. Depending on what you're doing, the difference may be significant."

提交回复
热议问题