问题
Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/
). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it.
回答1:
The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/
For example, the list of Unicode character ranges that match IsDigit (a.k.a. \d) is stored in the file /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl
回答2:
Even better than unicore/lib/gc_sc/Digit.pl
is unicore/To/Digit.pl
. It is a direct mapping of Unicode digit characters (well, really their offsets) to their numeric values. This means instead of:
use Unicode::Digits qw/digit_to_int/;
my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
my ($s, $e) = map hex, split;
for (my $ord = $s; $ord <= $e; $ord++) {
my $chr = chr $ord;
push @{$digits[digits_to_int $chr]}, $chr;
}
}
for my $i (0 .. 9) {
my $re = join '', "[", @{$digits[$i]}, "]";
$digits[$i] = qr/$re/;
}
I can say:
my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
my ($ord, $val) = split;
my $chr = chr hex $ord;
push @{$digits[$val]}, $chr;
}
for my $i (0 .. 9) {
my $re = join '', "[", @{$digits[$i]}, "]";
$digits[$i] = qr/$re/;
}
Or even better:
my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
my ($ord, $val) = split;
$digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;
回答3:
which characters /\d/ match depends entirely on your regexp implementation (although standard 0-9 are guaranteed). In the case of perl the perl locale used defines which characters are considered alphabetic and digits.
回答4:
There is no way to do that without iterating through all the characters. (if you create a huge string with all of them and use a regexp you still have to do the loop at least once, to create the string).
来源:https://stackoverflow.com/questions/1182420/how-do-i-get-a-list-of-all-unicode-characters-that-have-a-given-property