Only extract those words from a list that include no repeating letters, using regex

丶灬走出姿态 提交于 2021-02-10 08:41:52

问题


I have a large word list file with one word per line. I would like to filter out the words with repeating alphabets.

INPUT:
  abducts
  abe
  abeam
  abel
  abele

OUTPUT:
  abducts
  abe
  abel

I'd like to do this using Regex (grep or perl or python). Is that possible?


回答1:


It's much easier to write a regex that matches words that do have repeating letters, and then negate the match:

my @input = qw(abducts abe abeam abel abele);
my @output = grep { not /(\w).*\1/ } @input;

(This code assumes that @input contains one word per entry.) But this problem isn't necessarily best solved with a regex.

I've given the code in Perl, but it could easily be translated into any regex flavor that supports backreferences, including grep (which also has the -v switch to negate the match).




回答2:


$ egrep -vi '(.).*\1' wordlist



回答3:


It is possible to use regex:

import re

inp = [
    'abducts'
,   'abe'
,   'abeam'
,   'abel'
,   'abele'
]

# detect word which contains a character at least twice
rgx = re.compile(r'.*(.).*\1.*') 

def filter_words(inp):
    for word in inp:
        if rgx.match(word) is None:
            yield word

print list(filter_words(inp))



回答4:


Simple Stuff

Despite the inaccurate protestation that this is impossible with a regex, it certainly is.

While @cjm justly states that it is a lot easier to negate a positive match than it is to express a negative one as a single pattern, the model for doing so is sufficiently well-known that it becomes a mere matter of plugging things into that model. Given that:

    /X/

matches something, then the way to express the condition

    ! /X/

in a single, positively-matching pattern is to write it as

    /\A (?: (?! X ) . ) * \z /sx

Therefore, given that the positive pattern is

    / (\pL) .* \1 /sxi

the corresponding negative needs must be

    /\A (?: (?! (\pL) .* \1  ) . ) * \z /sxi

by way of simple substitution for X.

Real-World Concerns

That said, there are extenuating concerns that may sometimes require more work. For example, while \pL describes any code point having the GeneralCategory=Letter property, it does not consider what to do with words like red‐violet–colored, ’Tisn’t, or fiancée — the latter of which is different in otherwise-equivalent NFD vs NFC forms.

You therefore must first run it through full decomposition, so that a string like "r\x{E9}sume\x{301}" would correctly detect the duplicate “letter é’s” — that is, all canonically equivalent grapheme cluster units.

To account for such as these, you must at a bare minimum first run your string through an NFD decomposition, and then afterwards also use grapheme clusters via \X instead of arbitrary code points via ..

So for English, you would want something that followed along these lines for the positive match, with the corresponding negative match per the substitution give above:

    NFD($string) =~ m{
        (?<ELEMENT>
           (?= [\p{Alphabetic}\p{Dash}\p{Quotation_Mark}] ) \X 
        )
        \X *
        \k<ELEMENT>
    }xi

But even with that there still remain certain outstanding issues unresolved, such as for example whether \N{EN DASH} and \N{HYPHEN} should be considered equivalent elements or different ones.

That’s because properly written, hyphenating two elements like red‐violet and colored to form the single compound word red‐violet–colored, where at least one of the pair already contains a hyphen, requires that one employ an EN DASH as the separator instead of a mere HYPHEN.

Normally the EN DASH is reserved for compounds of like nature, such as a time–space trade‐off. People using typewriter‐English don’t even do that, though, using that super‐massively overloaded legacy code point, HYPHEN-MINUS, for both: red-violet-colored.

It just depends whether your text came from some 19th‐century manual typewriter — or whether it represents English text properly rendered under modern typesetting rules. :)

Conscientious Case Insensitivity

You will note I am here considering letter that differ in case alone to be the same one. That’s because I use the /i regex switch, ᴀᴋᴀ the (?i) pattern modifier.

That’s rather like saying that they are the same as collation strength 1 — but not quite, because Perl uses only case folding (albeit full case folding not simple) for its case insensitive matches, not some higher collation strength than the tertiary level as might be preferred.

Full equivalence at the primary collation strength is a significantly stronger statement, but one that may well be needed to fully solve the problem in the general case. However, that requires a lot more work than the problem necessarily requires in many specific instances. In short, it is overkill for many specific cases that actually arise, no matter how much it might be needed for the hypothetical general case.

This is made even more difficult because, although you can for example do this:

    my $collator = new Unicode::Collate::Locale::
                       level => 1, 
                       locale => "de__phonebook",
                       normalization => undef,
                    ;

    if ($collator->cmp("müß", "MUESS") == 0) { ... }

and expect to get the right answer — and you do, hurray! — this sort of robust string comparison is not easily extended to regex matches.

Yet. :)

Summary

The choice of whether to under‐engineer — or to over‐engineer — a solution will vary according to individual circumstances, which no one can decide for you.

I like CJM’s solution that negates a positive match, myself, although it’s somewhat cavalier about what it considers a duplicate letter. Notice:

    while ("de__phonebook" =~ /(?=((\w).*?\2))/g) {
        print "The letter <$2> is duplicated in the substring <$1>.\n";
    } 

produces:

    The letter <e> is duplicated in the substring <e__phone>.
    The letter <_> is duplicated in the substring <__>.
    The letter <o> is duplicated in the substring <onebo>.
    The letter <o> is duplicated in the substring <oo>.

That shows why when you need to match a letter, you should alwasy use \pL ᴀᴋᴀ \p{Letter} instead of \w, which actually matches [\p{alpha}\p{GC=Mark}\p{NT=De}\p{GC=Pc}].

Of course, when you need to match an alphabetic, you need to use \p{alpha} ᴀᴋᴀ\p{Alphabetic}, which isn’t at all the same as a mere letter — contrary to popular misunderstanding. :)




回答5:


If you're dealing with long strings that are likely to have duplicate letters, stopping ASAP may help.

INPUT: for (@input) {
   my %seen;
   while (/(.)/sg) {
      next INPUT if $seen{$1}++;
   }
   say;
}

I'd go with the simplest solution unless the performance is found to be really unacceptable.

my @output = grep !/(.).*?\1/s, @input;



回答6:


I was very curious about the relative speed of the various Perl-based methods submitted by other authors for this question. So, I decided to benchmark them.

Where necessary, I slightly modified each method so that it would populate an @output array, to keep the input and output consistent. I verified that all the methods produce the same @output, although I have not documented that assertion here.

Here is the script to benchmark the various methods:

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(cmpthese :hireswallclock);

# get a convenient list of words (on Mac OS X 10.6.6, this contains 234,936 entries)
open (my $fh, '<', '/usr/share/dict/words') or die "can't open words file: $!\n";
my @input = <$fh>;
close $fh;

# remove line breaks
chomp @input;

# set-up the tests (
my %tests = (

  # Author: cjm
  RegExp => sub { my @output = grep { not /(\w).*\1/ } @input },

  # Author: daotoad
  SplitCount => sub { my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input; },

  # Author: ikegami
  NextIfSeen => sub {
    my @output;
    INPUT: for (@input) {
      my %seen;
      while (/(.)/sg) {
        next INPUT if $seen{$1}++;
      }
      push @output, $_;
    }

  },

  # Author: ysth
  BitMask => sub {
    my @output;
    for my $word (@input) {
      my $mask1 = $word x ( length($word) - 1 );
      my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
      if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        push @output, $word;
      }
    }
  },

);

# run each test 100 times
cmpthese(100, \%tests);

Here are the results for 100 iterations.

           s/iter SplitCount    BitMask NextIfSeen     RegExp
SplitCount   2.85         --       -11%       -58%       -85%
BitMask      2.54        12%         --       -53%       -83%
NextIfSeen   1.20       138%       113%         --       -64%
RegExp      0.427       567%       496%       180%         --

As you can see, cjm's "RegExp" method is the fastest by far. It is 180% faster than the next fastest method, ikegami's "NextIfSeen" method. I suspect that the relative speed of the RegExp and NextIfSeen methods will converge as the average length of the input strings increases. But for "normal" length English words, the RegExp method is the fastest.




回答7:


cjm gave the regex, but here's an interesting non-regex way:

@words = qw/abducts abe abeam abel abele/;
for my $word (@words) {
    my $mask1 = $word x ( length($word) - 1 );
    my $mask2 = join( '', map { substr($word, $_), substr($word, 0, $_) } 1..length($word)-1 );
    if ( ( $mask1 ^ $mask2 ) !~ tr/\0// ) {
        print "$word\n";
    }
}



回答8:


In response to cjm's solution, I wondered about how it compared to some rather terse Perl:

my @output = grep { my @l = split ''; my %l; @l{@l} = (); keys %l == @l } @input;

Since I am not constrained in character count and formatting here, I'll be a bit clearer, even to the point of over-documenting:

my @output = grep {

    # Split $_ on the empty string to get letters in $_. 
    my @letters = split '';

    # Use a hash to remove duplicate letters.
    my %unique_letters;
    @unique_letters{@letters} = ();  # This is a hash slice assignment.
                                     # See perldoc perlvar for more info

    # is the number of unique letters equal to the number of letters?
    keys %unique_letters == @letters

} @input;

And, of course in production code, please do something like this:

my @output = grep ! has_repeated_chars($_), @input;

sub has_repeated_letters {
    my $word = shift;
    #blah blah blah
    # see example above for the code to use here, with a nip and a tuck.
}



回答9:


In python with a regex:

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

In python without a regex:

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including output in the timing:

Timings without the regex:

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop

Timings with the regex:

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop

Clearly the regex is a tiny bit slower than a simple frozenset creation and len comparison in python.




回答10:


You can't do this with Regex. Regex is a Finite State Machine, and this would require a stack to store what letters have been seen.

I would suggest doing this with a foreach and manually check each word with code. Something like

List chars
foreach word in list
    foreach letter in word
        if chars.contains letter then remove word from list
        else
            chars.Add letter
    chars.clear


来源:https://stackoverflow.com/questions/5637963/only-extract-those-words-from-a-list-that-include-no-repeating-letters-using-re

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!