regex-alternation

Regex Alternation Order

蹲街弑〆低调 提交于 2021-01-27 06:03:16
问题 I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be: ((13th|(Executive |Residential)|((\w+) ){1,3})Floor) Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue) The issue is if the string is on the

Regex Alternation Order

白昼怎懂夜的黑 提交于 2021-01-27 06:01:37
问题 I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be: ((13th|(Executive |Residential)|((\w+) ){1,3})Floor) Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue) The issue is if the string is on the

Regex Alternation Order

痞子三分冷 提交于 2021-01-27 06:01:00
问题 I set up a complex regex to extract data from a page of text. For some reason the order of the alternation is not what I expect. A simple example would be: ((13th|(Executive |Residential)|((\w+) ){1,3})Floor) Put simply I am trying to either get a floor number, a known named floor and, as a back-up, I capture 1-3 unknown words followed by floor just in case to review later (I in fact use a groupname to identify this but didn't want to confuse the issue) The issue is if the string is on the

Raku regex: Inconsistent longest token matching

若如初见. 提交于 2021-01-02 05:02:34
问题 Raku's regexes are expected to match longest token. And in fact, this behaviour is seen in this code: raku -e "'AA' ~~ m/A {say 1}|AA {say 2}/" # 2 However, when the text is in a variable, it does not seem to work in the same way: raku -e "my $a = 'A'; my $b = 'AA'; 'AA' ~~ m/$a {say 1}|$b {say 2}/" # 1 Why they work in a different way? Is there a way to use variables and still match the longest token? 回答1: There are two things at work here. The first is the meaning of "longest token". When

Java regex alternation operator “|” behavior seems broken

六月ゝ 毕业季﹏ 提交于 2020-01-09 07:17:09
问题 Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III" In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex

Priority in regex manipulating

南笙酒味 提交于 2019-12-20 06:18:33
问题 I write some java code to split string into array of string. First, I split that string using regex pattern "\\,\\,|\\," and then I split using pattern "\\,|\\,\\," . Why there are difference between output of the first and output of the second? public class Test2 { public static void main(String[] args){ String regex1 = "\\,\\,|\\,"; String regex2 = "\\,|\\,\\,"; String a = "20140608,FT141590Z0LL,0608103611018634TCKJ3301000000018667,3000054789,IDR1742630000001,80507,1000,6012,TCKJ3301,6.00E

Why won't a longer token in an alternation be matched?

穿精又带淫゛_ 提交于 2019-12-17 21:59:23
问题 I am using ruby 2.1, but the same thing can be replicated on rubular site. If this is my string: 儘管中國婦幼衛生監測辦公室制定的 And I do a regex match with this expression: (中國婦幼衛生監測辦公室制定|管中) I am expecting to get the longer token as a match. 中國婦幼衛生監測辦公室制定 Instead I get the second alternation as a match. As far as I know it does work like that when not in chinese characters. If this is my string: foobar And I use this regex: (foobar|foo) Returned matching result is foobar . If the order is in the other way

Why is a character class faster than alternation?

故事扮演 提交于 2019-11-27 05:05:47
It seems that using a character class is faster than the alternation in an example like: [abc] vs (a|b|c) I have heard about it being recommended and with a simple test using Time::HiRes I verified it (~10 times slower). Also using (?:a|b|c) in case the capturing parenthesis makes a difference does not change the result. But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation? This is