Java regex alternation operator “|” behavior seems broken

后端 未结 2 1299
一整个雨季
一整个雨季 2020-12-06 02:11

Trying to write a regex matcher for roman numerals. In sed (which I think is considered \'standard\' for regex?), if you have multiple options delimited by the alternation o

2条回答
  •  被撕碎了的回忆
    2020-12-06 02:44

    No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

    You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).

    "\\b(I|II|III|IV)\\b"
    

    EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

提交回复
热议问题