问题
Does Perl's \w match all alphanumeric characters defined in the Unicode standard?
For example, will \w match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (@ok) {
die unless ($ok =~ /^\w+$/);
}
回答1:
perldoc perlunicode says
Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database.
\wcan be used to match a Japanese ideograph, for instance.
So it looks like the answer to your question is "yes".
However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.
回答2:
Yes and no.
If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.
Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:
\p{GC=Alphabetic}\p{GC=Mark}\p{GC=Connector_Punctuation}\p{GC=Decimal_Number}
Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit}\p{General_Category=Decimal_Number}\p{GC=Decimal_Number}\p{Decimal_Number}\p{Nd}\p{Numeric_Type=Decimal}\p{Nt=De}
Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.
It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.
Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn
includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).
That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.
Aren’t you glad we get to use \w? :)
回答3:
In particular \w also matches the underscore character.
#!/usr/bin/perl -w
$name = 'Arun_Kumar';
($name =~ /\w+/)? print "Underscore is a word character\n": print "No underscores\n";
$ underscore.pl
Underscore is a word character.
来源:https://stackoverflow.com/questions/5555613/does-w-match-all-alphanumeric-characters-defined-in-the-unicode-standard