Perl - regex - Position of first nonmatching character

安稳与你 提交于 2019-12-03 16:59:05
dawg

What you are proposing is difficult but doable.

If I can paraphrase what I understand, you are wanting to find out how far a failing match got into a match. In order to do this, you need to be able to parse a regex.

The best regex parser is probably to use Perl itself with the -re=debug command line switch:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{5}/'
Compiling REx "gh[ijkl]{5}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {5,5} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 7 
Guessing start of match in sv for REx "gh[ijkl]{5}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{5}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {5,5}(16)
                                  ANYOF[i-l][] can match 4 times out of 5...
                                  failed...
Match failed
Freeing REx: "gh[ijkl]{5}"

You can shell out that Perl command line with your regex and parse the return of stdout. Look for the `

Here is a matching regex:

$ perl -Mre=debug -e'"abcdefghijklmnopqr"=~/gh[ijkl]{3}/'
Compiling REx "gh[ijkl]{3}"
Final program:
   1: EXACT <gh> (3)
   3: CURLY {3,3} (16)
   5:   ANYOF[i-l][] (0)
  16: END (0)
anchored "gh" at 0 (checking anchored) minlen 5 
Guessing start of match in sv for REx "gh[ijkl]{3}" against "abcdefghijklmnopqr"
Found anchored substr "gh" at offset 6...
Starting position does not contradict /^/m...
Guessed: match at offset 6
Matching REx "gh[ijkl]{3}" against "ghijklmnopqr"
   6 <bcdef> <ghijklmnop>    |  1:EXACT <gh>(3)
   8 <defgh> <ijklmnopqr>    |  3:CURLY {3,3}(16)
                                  ANYOF[i-l][] can match 3 times out of 3...
  11 <ghijk> <lmnopqr>       | 16:  END(0)
Match successful!
Freeing REx: "gh[ijkl]{3}"

You will need to build a parser that can handle the return from the Perl re debugger. The left hand and right hand angle braces show the distance into the string as the regex engine is trying to match.

This is not an easy project btw...

You can get the matching part, and use the index function to find its position:

my $x = 'abcdefghijklmnopqrstuvwxyz';

$x =~ /(g(h(o)?)?)/;
print index($x, $1) + length($1), "\n"; #8

This seems to work. Basically the idea is to split the regex into it's constituent parts and try them sequentially, returning the last matching position. The fixed strings need to be split up, but the character classes and quantifiers can be kept together.

In theory this should work, but it may need tweaking.

use v5.10;
use strict;
use warnings;

my $string = 'abcdefghijklmnopqrstuvwxyz';
my $match  = partial_match($string, qw(g h (?=i) [ijkx]+ [lmn]+ z));
say "match ended at pos $match, character ", substr($string,$match,1);

sub partial_match {
    my $string = shift;
    my @rx = @_;
    my $pos;
    if ($string =~ /$rx[0]/g) {
        $pos = pos $string;
        if (defined $rx[1]) {
            splice @rx, 0, 2, $rx[0] . $rx[1];
            $pos = partial_match($string, @rx) // $pos;
        } else { return $pos }
    } else {
        say "Didn't match $rx[0]";
        return;
    }
}

How about:

#!/usr/bin/perl 
use Modern::Perl;

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $s = 'gho';
do {
    if ($x =~ /$s/) {
        say "$s matches from $-[0] to $+[0]";
    } else {
        say "$s doesn't match";
    }
} while chop $s;

output:

gho doesn't match
gh matches from 6 to 8
g matches from 6 to 7
 matches from 0 to 0

I think thats exactly what the pos function is for. NOTE: pos only works if you use the /g flag

my $x = 'abcdefghijklmnopqrstuvwxyz';
my $end = 0;
if( $x =~ /$ARGV[0]/g )
{
    $end = pos($x);
}
print "End of match is: $end\n";

Gives the following output

[@centos5 ~]$ perl x.pl
End of match is: 0
[@centos5 ~]$ perl x.pl def
End of match is: 6
[@centos5 ~]$ perl x.pl xyz
End of match is: 26
[@centos5 ~]$ perl x.pl aaa
End of match is: 0
[@centos5 ~]$ perl x.pl ghi
End of match is: 9
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!