Why is /[\w-+]/ a valid regex but /[\w-+]/u invalid?

北战南征 提交于 2021-01-29 11:20:57

问题


If I type /[\w-+]/ in the Chrome console, it accepts it. I get a regex object I can use to test strings as usual. But if I type /[\w-+]/u, it says VM112:1 Uncaught SyntaxError: Invalid regular expression: /[\w-+]/: Invalid character class.

In Firefox, /[\w-+]/ works fine, but if I type /[\w-+]/u in the console, it just goes to the next line as if I typed an incomplete statement. If I try to force it to create the regex by running eval('/[\w-+]/u'), it tells me SyntaxError: invalid range in character class.

Why does the u flag make the regex invalid? The MDN RegExp documentation says u enables some Unicode features, but I don't see anything about how it affects ranges in character classes.


回答1:


Within a RegExp character set, a hyphen-minus character (your standard keyboard dash) denotes a range of character codes between the two characters it separates. The exceptions are when it is escaped (\-) or when it does not separate two characters because it is either the final character of the class or it is the first character (after the optional caret that inverts the class).

Three examples of character ranges: a simple example, an advanced example, and a bug:

  • [a-z] is pretty straightforward because it works the way we expect it to, though this is actually because the character codes happen to be sequential. Another way of writing this is [\x61-\x7a]
  • [!-~] is not at all straightforward, at least until you look at a character map and learn that ! is the first printable ASCII character and ~ is the last (of "lower ASCII"), so this is a way of saying "all printable lower ASCII characters" and it is the equivalent of [\x21-\x7e]
  • [A-z] has a switched case in it. You may dislike the fact that there are six non-letter characters accepted by this range (which is [\x41-\x7a])


Now let's examine your regex of /[\w-+]/u Regex101 has a more informative error: "You can not create a range with a shorthand escape sequences"

Since \w is not itself a character (but rather a collection of characters), an abutting dash must either be taken literally or else an error. When you invoke it with the /u flag to trigger fullUnicode, you enter a more strict mode and therefore get an error.

The error I get from "foo".match(/[\w-+]/u) in Firefox 64.0 is:

SyntaxError: character class escape cannot be used in class range in regular expression

This is slightly more informative than the error you got since it actually tells you the problem is with the escape (though not why it's a problem).

According to ECMAScript 2015's RegExBuiltinExec() logic:

  1. If fullUnicode is true, then
    1. e is an index into the Input character list, derived from S, matched by matcher. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the length of Input, then eUTF is the number of code units in S.
    2. Let e be eUTF.

This seems to be explicitly building its own range-parsing logic.


The solution is to either escape your hyphen-minus or else put it last (or first):

/[\w\-+]/u or /[\w+-]/u or /[-\w+]/u. I personally always put it last.




回答2:


There is a report for this: V8 implementation: does unicode property escapes behavior in character classes range differ from other classes intentionally?.


I took a look at V8 source code (regexp-parser.cc) and found this:

if (is_class_1 || is_class_2) {
    // Either end is an escaped character class. Treat the '-' verbatim.
    if (unicode()) {
       // ES2015 21.2.2.15.1 step 1.
       return ReportError(CStrVector(kRangeInvalid));
    }

kRangeInvalid is a constant that holds Invalid character class.

21.2.2.15.1 step 1.

If A does not contain exactly one character or B does not contain exactly one character, throw a SyntaxError exception.



来源:https://stackoverflow.com/questions/54205197/why-is-w-a-valid-regex-but-w-u-invalid

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!