Perl Regular expression | how to exclude words from a file

无人久伴 提交于 2019-12-11 01:52:45

问题


i searching to find some Perl Regular Expression Syntax about some requirements i have in a project. First i want to exclude strings from a txt file (dictionary).

For example if my file have this strings:

path.../Document.txt |
  tree
  car
  ship

i using Regular Expression

a1testtre  --  match
orangesh1  --  match
apleship3  --  not match  [contains word from file ]

Also i have one more requirement that i couldnt solve. I have to create a Regex that not allow a String to have over 3 times a char repeat (two chars).

For example :

adminnisstrator21     -- match  (have 2 times a repetition of chars)
kkeeykloakk           -- not match have over 3 times repetition
stack22ooverflow      -- match  (have 2 times a repetition of chars)

for this i have try

\b(?:([a-z])(?!\1))+\b

but it works only for the first char-reppeat Any idea how to solve these two?


回答1:


One way to exclude strings that contain words from a given list is to form a pattern with an alternation of the words and use that in a regex, whereby a match excludes the string.

use warnings;
use strict;
use feature qw(say);

use Path::Tiny;

my $file = shift // die "Usage: $0 file\n";  #/

my @words = split ' ', path($file)->slurp;

my $exclude = join '|', map { quotemeta } @words;

foreach my $string (qw(a1testtre orangesh1 apleship3)) 
{ 
    if ($string !~ /$exclude/) { 
        say "OK: $string"; 
    }
}

I use Path::Tiny to read the file into a a string ("slurp"), which is then split by whitespace into words to use for exclusion. The quotemeta escapes non-"word" characters, should any happen in your words, which are then joined by | to form a string with a regex pattern. (With complex patterns use qr.)

This may be possible to tweak and improve, depending on your use cases, for one in regards to the order of of patterns with common parts in alternation.

The check that successive duplicate characters do not occur more than three times

foreach my $string (qw(adminnisstrator21 kkeeykloakk stack22ooverflow))
{
    my @chars_that_repeat = $string =~ /(.)\1+/g;

    if (@chars_that_repeat < 3) { 
        say "OK: $string";
    }
}

A long string of repeated chars (aaaa) counts as one instance, due to the + quantifier in regex; if you'd rather count all pairs remove the + and four as will count as two pairs. The same char repeated at various places in the string counts every time, so aaXaa counts as two pairs.

This snippet can be just added to the above program, which is invoked with the name of the file with words to use for exclusion. They both print what is expected from provided samples.


  Consider an example with exclusion-words: so, sole, and solely. If you only need to check whether any one of these matches then you'd want shorter ones first in the alternation

my $exclude = join '|', map { quotemeta } sort { length $a <=> length $b } @words;
#==>  so|sole|solely

for a quicker match (so matches all three). This, by all means, appears to be the case here.

But, if you wanted to correctly identify which word matched then you must have longer words first,

solely|sole|so

so that a string solely is correctly matched by its word before it can be "stolen" by so. Then in this case you'd want it the other way round, sort { length $b <=> length $a }




回答2:


To not match a word from a file you might check whether a string contains a substring or use a negative lookahead and an alternation:

^(?!.*(?:tree|car|ship)).*$
  • ^ Assert start of string
  • (?! negative lookahead, assert what is on the right is not
    • .*(?:tree|car|ship) Match 0+ times any char except a newline and match either tree car or ship
  • ) Close negative lookahead
  • .* Match any char except a newline
  • $ Assert end of string

Regex demo

To not allow a string to have over 3 times a char repeat you could use:

\b(?!(?:\w*(\w)\1){3})\w+\b
  • \b Word boundary
  • (?! Negative lookahead, assert what is on the right is not
    • (?: NOn capturing group
    • \w*(\w)\1 Match 0+ times a word character followed by capturing a word char in a group followed by a backreference using \1 to that group
    • ){3} Close non capturing group and repeat 3 times
  • ) close negative lookahead
  • \w+ Match 1+ word characters
  • \b word boundary

Regex demo

Update

According to this posted answer (which you might add to the question instead) you have 2 patterns that you want to combine but it does not work:

(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)

In those 2 patterns you use 2 capturing groups, so the second pattern has to point to the second capturing group \2.

(?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\2){4}))*$)
                                               ^  

Pattern demo




回答3:


I hope someone else will come with a better solution, but this seems to do what you want:

\b                          Match word boundary
  (?:                       Start capture group
    (?:([a-z0-9])(?!\1))*   Match all characters until it encounters a double
    (?:([a-z0-9])\2)+       Match all repeated characters until a different one is reached
  ){0,2}                    Match capture group 0 or 2 times
  (?:([a-z0-9])(?!\3))+     Match all characters until it encounters a double
\b                          Match end of word

I changed the [a-z] to also match numbers, since the examples you gave seem to also include numbers. Perl regex also has the \w shorthand, which is equivalent to [A-Za-z0-9_], which could be handy if you want to match any character in a word.




回答4:


My problem is that i have 2 regex that working:

Not allow over 3 pairs of chars:

          (?=^(?!(?:\w*(.)\1){3}).+$)

Not allow over 4 times a char to repeat:

        (?=^(?:(.)(?!(?:.*?\1){4}))*$)

Now i want to combine them into one row like:

      (?=^(?!(?:\w*(.)\1){3}).+$)(?=^(?:(.)(?!(?:.*?\1){4}))*$)

but its working only the regex that is first and not both of them




回答5:


As mentioned in comment to @zdim's answer, take it a bit further by making sure that the order in which your words are assembled into the match pattern doesn't trip you. If the words in the file are not very carefully ordered to start, I use a subroutine like this when building the match string:

# Returns a list of alternative match patterns in tight matching order.
# E.g., TRUSTEES before TRUSTEE before TRUST   
# TRUSTEES|TRUSTEE|TRUST

sub tight_match_order {
    return @_ unless @_ > 1;
    my (@alts, @ordered_alts, %alts_seen);
    @alts   = map { $alts_seen{$_}++ ? () : $_ } @_;
    TEST: {
        my $alt = shift @alts;
        if (grep m#$alt#, @alts) {
            push @alts => $alt;
        } else {
            push @ordered_alts => $alt;
        }
        redo TEST if @alts;
    }
    @ordered_alts
}

So following @zdim's answer:

...
my @words = split ' ', path($file)->slurp;

@words = tight_match_order(@words); # add this line

my $exclude = join '|', map { quotemeta } @words;
...

HTH



来源:https://stackoverflow.com/questions/55728688/perl-regular-expression-how-to-exclude-words-from-a-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!