Use Perl to check if a string has only English characters

问题

I have a file with submissions like this

%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

I am stripping everything but the song name by using this regex.

$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$@%#\\|]//g;

I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.

I have tried this

if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
    print $line;
}
else {
    print "Non-english\n";

I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.

回答1:

Following from the comments, your problem would appear to be:

$line =~ m/[^a-zA-z0-9_]*$/

Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator

See: http://perldoc.perl.org/perlrecharclass.html#Negation

It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".

But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.

(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).

回答2:

It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.

It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this

use strict;
use warnings;
use 5.010;

while ( <DATA> ) {
    chomp;
    my @fields = split /<SEP>/;
    say $fields[3];
}

__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

output

Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi

Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this

if ( $line =~ /\W/ ) {
    print "Non-English\n";
}
else {
    print $line;
}

来源：https://stackoverflow.com/questions/28488681/use-perl-to-check-if-a-string-has-only-english-characters

标签

regex

perl