Matching (e.g.) a Unicode letter with Java regexps

前端 未结 3 564
后悔当初
后悔当初 2020-12-09 17:50

There are many questions and answers here on StackOverflow that assume a \"letter\" can be matched in a regexp by [a-zA-Z]. However with Unicode there are many

3条回答
  •  暗喜
    暗喜 (楼主)
    2020-12-09 18:22

    Quoting from the JavaDoc of java.util.regex.Pattern.

    Unicode support

    This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

    Unicode escape sequences such as \u2014 in Java source code are processed as described in §3.3 of the Java Language Specification. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.

    Unicode blocks and categories are written with the \p and \P constructs as in Perl. \p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property. Blocks are specified with the prefix In, as in InMongolian. Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Blocks and categories can be used both inside and outside of a character class.

    The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative. The block names supported by Pattern are the valid block names accepted and defined by UnicodeBlock.forName.

提交回复
热议问题