Match whitespace but not newlines

后端 未结 6 1465
忘掉有多难
忘掉有多难 2020-11-22 15:57

I sometimes want to match whitespace but not newline.

So far I\'ve been resorting to [ \\t]. Is there a less awkward way?

6条回答
  •  [愿得一人]
    2020-11-22 16:44

    What you are looking for is the POSIX blank character class. In Perl it is referenced as:

    [[:blank:]]
    

    in Java (don't forget to enable UNICODE_CHARACTER_CLASS):

    \p{Blank}
    

    Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).

    But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:

    • U+180E MONGOLIAN VOWEL SEPARATOR

    • U+200B ZERO WIDTH SPACE

    • U+200C ZERO WIDTH NON-JOINER

    • U+200D ZERO WIDTH JOINER

    • U+2060 WORD JOINER

    • U+FEFF ZERO WIDTH NON-BREAKING SPACE

      Taken from https://en.wikipedia.org/wiki/White-space_character

    The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.

    In Java:

    static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
    

提交回复
热议问题