Use Perl to check if a string has only English characters

感情迁移 提交于 2021-01-28 18:18:26

问题


I have a file with submissions like this

%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

I am stripping everything but the song name by using this regex.

$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$@%#\\|]//g;

I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.

I have tried this

if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
    print $line;
}
else {
    print "Non-english\n";

I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.


回答1:


Following from the comments, your problem would appear to be:

$line =~ m/[^a-zA-z0-9_]*$/

Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator

See: http://perldoc.perl.org/perlrecharclass.html#Negation

It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".

But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.

(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).




回答2:


It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.

It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this

use strict;
use warnings;
use 5.010;

while ( <DATA> ) {
    chomp;
    my @fields = split /<SEP>/;
    say $fields[3];
}

__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

output

Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi

Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this

if ( $line =~ /\W/ ) {
    print "Non-English\n";
}
else {
    print $line;
}


来源:https://stackoverflow.com/questions/28488681/use-perl-to-check-if-a-string-has-only-english-characters

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!