Java regex alternation operator “|” behavior seems broken

后端 未结 2 1297
一整个雨季
一整个雨季 2020-12-06 02:11

Trying to write a regex matcher for roman numerals. In sed (which I think is considered \'standard\' for regex?), if you have multiple options delimited by the alternation o

2条回答
  •  暖寄归人
    2020-12-06 02:22

    I think a pattern that will work is something like

    IV|I{1,3}

    See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

    Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.

    Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).

    Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

提交回复
热议问题