Why non-greedy quantifier sometimes doesn't work in Oracle regex?

白昼怎懂夜的黑 提交于 2019-12-17 16:37:26

问题


IMO, this query should return A=1,B=2,

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,') as A_and_B FROM dual

But it returns whole string A=1,B=2,C=3, instead. Why?

UPD: Oracle 10.2+ required to use Perl-style metacharacters in regular expressions.

UPD2:
More clear form of my question (to avoid questions about Oracle version and availability of Perl-style regex extension):
Why ON THE SAME SYSTEM non-greedy quantifier sometimes works as expected and sometimes doesn't?

This works correctly:

regexp_substr('A=1,B=2,C=3,', 'B=.*?,')

This doesn't work:

regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')

fiddle

UPD3:
Yes, it seems to be a bug.
Can anyone provide Oracle Support reaction on this issue?
Is the bug already known?
Does it have an ID?


回答1:


It's a BUG!

You are right that in Perl, 'A=1,B=2,C=3,' =~ /.*B=.*?,/; print $& prints A=1,B=2,

What you have stumbled upon is a bug that still exists in Oracle Database 11g R2. If the exact same regular expression operator (excluding the greediness modifier) appears twice in a regular expression, both occurrences will have the greediness indicated by the first appearance regardless of the greediness specified by the second one. That this is a bug is clearly demonstrated by these results:

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^Bx]*?,') as good FROM dual;

GOOD
--------
A=1,B=2,

SQL> SELECT regexp_substr('A=1,B=2,C=3,', '[^B]*B=[^B]*?,') as bad FROM dual;

BAD
-----------
A=1,B=2,C=3,

The only difference between the two regular expressions is that the "good" one excludes 'x' as a possible match in the second matching list. Since 'x' does not appear in the target string, excluding it should make no difference, but as you can see, removing the 'x' makes a big difference. That has to be a bug.

Here are some more examples from Oracle 11.2: (SQL Fiddle with even more examples)

SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*?,')  FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.*,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*?,') FROM dual; =>  A=1,B=2,
SELECT regexp_substr('A=1,B=2,C=3,', '.*?B=.*,')  FROM dual; =>  A=1,B=2,
-- Changing second operator from * to +
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+?,')  FROM dual; =>  A=1,B=2,
SELECT regexp_substr('A=1,B=2,C=3,', '.*B=.+,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.+B=.+,')   FROM dual; =>  A=1,B=2,C=3,
SELECT regexp_substr('A=1,B=2,C=3,', '.+?B=.+,')  FROM dual; =>  A=1,B=2,

The pattern is consistent: the greediness of the first occurrence is used for the second occurrence whether it should be or not.




回答2:


Looking at the feedback, I hesitate to jump in, but here I go ;-)

According to the Oracle docs, the *? and +? match a "preceding subexpression". For *? specifically:

Matches zero or more occurrences of the preceding subexpression (nongreedyFootref 1). Matches the empty string whenever possible.

To create a subexpression group, use parenthesis ():

Treats the expression within the parentheses as a unit. The expression can be a string or a complex expression containing operators.

You can refer to a subexpression in a back reference.

This will allow you to use greedy and non-greedy (many alternating times actually) in the same regexp, with expected results. For your example:

select regexp_substr('A=1,B=2,C=3,', '(.)*B=(.)*?,') from dual;

To make the point a bit more clear (i hope), this example uses greedy and non-greedy in the same regexp_substr, with different (correct) results depending on where the ? is placed (it does NOT just use the rule for the first subexpression it sees). Also note that the subexpression (\w) will match alphanumerics and underscore only, not @.

-- non-greedy followed by greedy 
select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*?@(\w)*') from dual;

result: 1_@_2_a_3_

-- greedy followed by non-greedy
select regexp_substr('1_@_2_a_3_@_4_a', '(\w)*@(\w)*?') from dual;

result: 1_@




回答3:


You've got a really great bounty, so I'm going to try to nail it comprehensively.

You make assumptions in your regular expression handling that are incorrect.

  1. Oracle is NOT compatible with Perl regular expressions, it is compatible with POSIX. It describes its support for Perl as "Perl-Influenced"
  2. There is an intrinsic syntax conflict around the use of the Perl "*?" in Oracle, if you read that reference the way I do, and Oracle legitimately chooses the POSIX usage
  3. Your description of how perl handles "*?" is not quite right.

Here is a mashup of the options we've discussed. The key to this issue is around case 30

    CASE    SRC                             TEXT               RE                FROM_WHOM                                          RESULT        
    ------- ------------------------------- ------------------ ----------------- -------------------------------------------------- --------------
          1 Egor's original source string   A=1,B=2,C=3,       .*B=.*?,          Egor's original pattern "doesn't work"             A=1,B=2,C=3,  
          2 Egor's original source string   A=1,B=2,C=3,       .*B=.?,           Egor's "works correctly"                           A=1,B=2,      
          3 Egor's original source string   A=1,B=2,C=3,       .*B=.+?,          Old Pro comment 1 form 2                           A=1,B=2,      
          4 Egor's original source string   A=1,B=2,C=3,       .+B=.*?,          Old Pro comment 1 form 1                           A=1,B=2,      
          5 Egor's original source string   A=1,B=2,C=3,       .*B=.{0,}?,       Old Pro comment 2                                  A=1,B=2,      
          6 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^Bx]*?,   Old Pro answer form 1 "good"                       A=1,B=2,      
          7 Egor's original source string   A=1,B=2,C=3,       [^B]*B=[^B]*?,    Old Pro answer form 2 "bad"                        A=1,B=2,C=3,  
          8 Egor's original source string   A=1,B=2,C=3,       (.)*B=(.)*?,      TBone answer form 1                                A=1,B=2,      
          9 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*?@(\w)*      TBone answer example 2 form 1                      1_@_2_a_3_    
         10 TBone answer example 2          1_@_2_a_3_@_4_a    (\w)*@(\w)*?      TBone answer example 2 form 2                      1_@           
         30 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*?,        Schemaczar Variant to force Perl operation         A=1,B=2,      
         31 Egor's original source string   A=1,B=2,C=3,       .*B=(.*)?,        Schemaczar Variant of Egor to force POSIX          A=1,B=2,C=3,  
         32 Egor's original source string   A=1,B=2,C=3,       .*B=.*{0,1}       Schemaczar Applying Egor's  'non-greedy'           A=1,B=2,C=3,  
         33 Egor's original source string   A=1,B=2,C=3,       .*B=(.)*{0,1}     Schemaczar Another variant of Egor's "non-greedy"  A=1,B=2,C=3,  

I am pretty sure that CASE 30 is what you thought you were writing - that is, you thought the "*?" had a stronger association than the "*" by itself. True for Perl, I guess, but for Oracle (and presumably canonical POSIX) RE's, the "*?" has a lower precedence and associativity than "*". So Oracle reads it as "(.*)?" (case 31) whereas Perl reads it as "(.)*?", that is, case 30.

Note cases 32 and 33 indicate that "*{0,1}" does not work like "*?".

Note that Oracle REGEXP does not work like LIKE, that is, it does not require the match pattern to cover the entire test string. Using the "^" begin and "$" end markers might help you with this as well.

My script:

SET SERVEROUTPUT ON

<<DISCREET_DROP>> begin
  DBMS_OUTPUT.ENABLE;
  for dropit in (select 'DROP TABLE ' || TABLE_NAME || ' CASCADE CONSTRAINTS' AS SYNT
  FROM TABS WHERE TABLE_NAME IN ('TEST_PATS', 'TEST_STRINGS')
  )
  LOOP
    DBMS_OUTPUT.PUT_LINE('Dropping via ' || dropit.synt);
    execute immediate dropit.synt;
  END LOOP;
END DISCREET_DROP;
/

--------------------------------------------------------
--  DDL for Table TEST_PATS
--------------------------------------------------------

  CREATE TABLE TEST_PATS 
   (    RE VARCHAR2(2000), 
  FROM_WHOM VARCHAR2(50), 
  PAT_GROUP VARCHAR2(50), 
  PAT_ORDER NUMBER(9,0)
   ) ;
/
--------------------------------------------------------
--  DDL for Table TEST_STRINGS
--------------------------------------------------------

  CREATE TABLE TEST_STRINGS 
   (    TEXT VARCHAR2(2000), 
  SRC VARCHAR2(200), 
  TEXT_GROUP VARCHAR2(50), 
  TEXT_ORDER NUMBER(9,0)
   ) ;
/
--------------------------------------------------------
--  DDL for View REGEXP_TESTER_V
--------------------------------------------------------

  CREATE OR REPLACE FORCE VIEW REGEXP_TESTER_V (CASE_NUMBER, SRC, TEXT, RE, FROM_WHOM, RESULT) AS 
  select pat_order as case_number,
  src, text, re, from_whom, 
  regexp_substr (text, re) as result
from test_pats full outer join test_strings on (text_group = pat_group)
order by pat_order, text_order;
/
REM INSERTING into TEST_PATS
SET DEFINE OFF;
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*?,','Egor''s original pattern "doesn''t work"','Egor',1);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.?,','Egor''s "works correctly"','Egor',2);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*?,','Schemaczar Variant to force Perl operation','Egor',30);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.*)?,','Schemaczar Variant of Egor to force POSIX','Egor',31);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.*{0,1}','Schemaczar Applying Egor''s  ''non-greedy''','Egor',32);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=(.)*{0,1}','Schemaczar Another variant of Egor''s "non-greedy"','Egor',33);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^Bx]*?,','Old Pro answer form 1 "good"','Egor',6);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('[^B]*B=[^B]*?,','Old Pro answer form 2 "bad"','Egor',7);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.+?,','Old Pro comment 1 form 2','Egor',3);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.*B=.{0,}?,','Old Pro comment 2','Egor',5);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('.+B=.*?,','Old Pro comment 1 form 1','Egor',4);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(.)*B=(.)*?,','TBone answer form 1','Egor',8);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*?@(\w)*','TBone answer example 2 form 1','TBone',9);
Insert into TEST_PATS (RE,FROM_WHOM,PAT_GROUP,PAT_ORDER) values ('(\w)*@(\w)*?','TBone answer example 2 form 2','TBone',10);
REM INSERTING into TEST_STRINGS
SET DEFINE OFF;
Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('A=1,B=2,C=3,','Egor''s original source string','Egor',1);
Insert into TEST_STRINGS (TEXT,SRC,TEXT_GROUP,TEXT_ORDER) values ('1_@_2_a_3_@_4_a','TBone answer example 2','TBone',2);

COLUMN SRC FORMAT A50 WORD_WRAP
COLUMN TEXT  FORMAT A50 WORD_WRAP
COLUMN RE FORMAT A50 WORD_WRAP
COLUMN FROM_WHOM FORMAT A50 WORD_WRAP
COLUMN RESULT  FORMAT A50 WORD_WRAP

SELECT * FROM REGEXP_TESTER_V;



回答4:


because your selecting too much

SELECT 
  regexp_substr(
    'A=1,B=2,C=3,', 
    '.*?B=.*?,'
  ) as A_and_B,  -- Now works as expected
  regexp_substr(
    'A=1,B=2,C=3,', 
    'B=.*?,'
  ) as B_only    -- works just fine
FROM dual

SQL fiddle : http://www.sqlfiddle.com/#!4/d41d8/11450



来源:https://stackoverflow.com/questions/16702672/why-non-greedy-quantifier-sometimes-doesnt-work-in-oracle-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!