How to order regular expression alternatives to get longest match?

前端未结

关注

 3  1963

I have a number of regular expressions regex1, regex2, ..., regexN combined into a single regex as regex1|regex2|...|regexN


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2020-12-06 18:07
              
            
            
                                                                       
For sure a human might be able judging whther two given regexp are matching prefixes for some cases. In general this is an n-p-complete problem. So don't try.

In the best case combining the different regexp into a single one will give a suitable result cheap. However, I'm not aware of any algorithm that can take two arbitrary regexp and combine them in a way that the resulting regexp is still matching what any of the two would match. It would be n-p-complete also.

You must also not rely on ordering of alternatives. This depends on the internal execution logic of the regexp engine. It could easily be that this is reordering the alternatives internally beyond your control. So, a valid ordering with current engine mmight give wrong results with a different engine. (So, it could help as long as you stay with a single regexp engine implementation)

Best approach seems to me to simply execute all regexp, keep track of the matched length and then take the longest match.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-12-06 18:15
              
            
            
                                                                       
Longest Match

Unfortunately, there is no distinct logic to tell a regular expression

engine to get the longest match possible.  

Doing so would/could create a cascading backtracking episode gone wild.

It is, by definition a complexity too great to deal with.  

All regular expressions are processed from left to right.

Anything the engine can match first it will, then bail out.  

This is especially true of alternations, where  this|this is|this is here

will always match 'this is here' first and

will NEVER ever match this is nor this is here  

Once you realize that, you can reorder the alternation into

this is here|this is|this which gives the longest match every time.  

Of course this can be reduced to this(?:(?: is)? here)?

which is the clever way of getting the longest match.  

Haven't seen any examples of the regex's you want to combine,

so this is just some general information.  

If you show the regexes you're trying to combine, better solution could be

provided.

Alternation contents do affect each other, as well as whatever precedes or

follows the cluster can have an affect on which alternation gets matched.  

If you have more questions just ask.  



Addendum:  

For @Laurel. This could always be done with a Perl 5 regex (>5.10)

because Perl can run code from within regex sub-expressions.

Since it can run code, it can count and get the longest match.  

The rule of leftmost first, however, will never change.

If regex were thermodynamics, this would be the first law.  

Perl is a strange entity as it tries to create a synergy between regex

and code execution.  

As a result, it is possible to overload it's operators, to inject

customization into the language itself.

Their regex engine is no different, and can be customized the same way.  

So, in theory, the regex below can be made into a regex construct,

a new Alternation construct.  

I won't go into detail's here, but suffice it to say, it's not for the faint at heart.

If you're interested in this type of thing, see the perlre manpage under

section 'Creating Custom RE Engines'

Perl:  

Note - The regex alternation form is based on @Laurel complex example

(a|ab.*c|.{0,2}c*d) applied to abcccd.  

Visually, if made into a custom regex construct, would look similar to

an alternation (?:rx1||rx2||rx3) and I'm guessing this is how a lot of

Perl6 is done in terms of integrating regex engine directly into the language.  

Also, if used as is, it's possible to construct this regex dynamically as needed.

And note that all the richness of Perl regex constructs are available.  

Output

Longest Match Found:  abcccd


Code  

use strict;
use warnings;

my ($p1,$p2,$p3) = (0,0,0);
my $targ = 'abcccd';

# Formatted using RegexFormat7 (www.regexformat.com)

if ( $targ =~
/
   # The Alternation Construct
     (?=
          ( a )                         # (1)
          (?{ $p1 = length($^N) })
     )?
     (?=
          ( ab .* c )                   # (2)
          (?{ $p2 = length($^N) })
     )?
     (?=
          ( .{0,2} c*d )                # (3)
          (?{ $p3 = length($^N) })
     )?
   # Check At Least 1 Match
     (?(1)
          (?(2)
               (?(3)
                 |  (?!)
               )
          )
     )
   # Consume Longest Alternation Match
     (                                  # (4 start)
          (?(?{
               $p1>=$p2 && $p1>=$p3
            })
               \1 
            |  (?(?{
                    $p2>=$p1 && $p2>=$p3
                 })
                    \2 
                 |  (?(?{
                         $p3>=$p1 && $p3>=$p2
                      })
                         \3 
                    )
               )
          )
     )                                  # (4 end)
/x ) {

    print "Longest Match Found:  $4\n";
} else {
    print "Did not find a match!\n";
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2020-12-06 18:19
              
            
            
                                                                       
Use the right regex flavor!

In some regex flavors, the alternation providing the longest match is the one that is used ("greedy alternation"). Note that most of these regex flavors are old (yet still used today), and thus lack some modern constructs such as back references.

Perl6 is modern (and has many features), yet defaults to the POSIX-style longest alternation. (You can even switch styles, as || creates an alternator that short-circuits to first match.) Note that the :Perl5/:P5 modifier is needed in order to use the "traditional" regex style.

Also, PCRE and the newer PCRE2 have functions that do the same. In PCRE2, it's pcre2_dfa_match. (See my section Relevant info about regex engine design section for more information about DFAs.)

This means, you can have ANY order of statements in a pipe and the result will always be the longest.

(This is different from the "absolute longest" match, as no amount of rearranging the terms in an alternation will change the fact that all regex engines traverse the string left-to-right. With the exception of .NET, apparently, which can go right-to-left. But traversing the string backwards wouldn't guarantee the "absolute longest" match either.) If you really want to find matches at (only) the beginning of a string, you should anchor the expression: ^(regex1|regex2|...).

According to this page*:


  The POSIX standard, however, mandates that the longest match be returned. When applying Set|SetValue to SetValue, a POSIX-compliant regex engine will match SetValue entirely.




* Note: I do not have the ability to test every POSIX flavor. Also, some regex flavors (Perl6) have this behavior without being POSIX compliant overall.

Let me give you one specific example that I have verified on my own computer:

echo "ab c a" | sed -E 's/(a|ab)/replacement/'

The regex is (a|ab). When it runs on the string ab c a you get : replacement c a, meaning that you do, in fact, get the longest match that the alternator can provide.

This regex, for a more complex example, (a|ab.*c|.{0,2}c*d) applied to abcccd, will return abcccd.

Try it here!

More clarification: the regex engine will not go forward (in the search string) to see if there is an even longer match once it can match something. It will only look through the current list of alterations to see if another one will match a longer string (from the position where the initial match starts).

In other words, no matter the order of choices in an alteration, POSIX compliant regexes use the one that matches the most characters.



Other examples of flavors with this behavior:


Tcl ARE
POSIX ERE
GNU BRE
GNU ERE


Relevant information about regex engine design

This question asks about designing an engine, but the answers may be helpful to understand how these engines work. Essentially, DFA-based algorithms determine the common overlap of different expressions, especially those within an alternation. It might be worth checking out this page. It explains how alternatives can be combined into a single path: 




Note: at some point, you might just want to consider using an actual programming language. Regexes aren't everything.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复