Regular expression to match boundary between different Unicode scripts

馋奶兔 提交于 2019-12-01 04:18:53

EDIT: I just noticed you didn’t actually specify which pattern-matching language you were using. Well, I hope a Perl solution will work for you, since the needed mechanations are likely to be really tough in any other language. Plus if you’re doing pattern matching with Unicode, Perl really is the best choice available for that particular kind of work.


When the $rx variable below is set to the appropriate pattern, this little snippet of Perl code:

my $data = "foo1 and Πππ 語語語 done";

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n"; 
} 

Generates this output:

Got string: 'foo1 and '
Got string: 'Πππ '
Got string: '語語語 '
Got string: 'done'

That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. This is pretty darned closed to what I think you actually need.

The reason I didn’t post this yesterday is that I was getting weird core dumps. Now I know why.

My solution uses lexical variables inside of a (??{...}) construct. Turns out that that is unstable before v5.17.1, and at best worked only by accident. It fails on v5.17.0, but succeeds on v5.18.0 RC0 and RC2. So I’ve added a use v5.17.1 to make sure you’re running something recent enough to trust with this approach.

First, I decided that you didn’t actually want a run of all the same script type; you wanted a run of all the same script type plus Common and Inherited. Otherwise you will get messed up by punctuation and whitespace and digits for Common, and by combining characters for Inherited. I really don’t think you want those to interrupt your run of “all the same script”, but if you do, it’s easy to stop considering those.

So what we do is lookahead for the first character that has a script type of other than Common or Inherited. More than that, we extract from it what that script type actually is, and use this information to construct a new pattern that is any number of characters whose script type is either Common, Inherited, or whatever script type we just found and saved off. Then we evaluate that new pattern and continue.

Hey, I said it was hairy, didn’t I?

In the program I’m about to show, I’ve left in some commented-out debugging statements that show just what it’s doing. If you uncomment them, you get this output for the last run, which should help understand the approach:

DEBUG: Got peekahead character f, U+0066
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'foo1 and '
DEBUG: Got peekahead character Π, U+03a0
DEBUG: Scriptname is Greek
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
Got string: 'Πππ '
DEBUG: Got peekahead character 語, U+8a9e
DEBUG: Scriptname is Han
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
Got string: '語語語 '
DEBUG: Got peekahead character d, U+0064
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'done'

And here at last is the big hairy deal:

use v5.17.1;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use utf8;

use Unicode::UCD qw(charscript);

# regex to match a string that's all of the
# same Script=XXX type
#
my $rx = qr{
    (?=
       [\p{Script=Common}\p{Script=Inherited}] *
        (?<CAPTURE>
            [^\p{Script=Common}\p{Script=Inherited}]
        )
    )
    (??{
        my $capture = $+{CAPTURE};
   #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture;
        my $scriptname = charscript(ord $capture);
   #####print "DEBUG: Scriptname is $scriptname\n";
        my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                . $scriptname
                . q(}]*);
   #####print "DEBUG: string to re-interpolate as regex is q{$run}\n";
        $run;
    })
}x;


my $data = "foo1 and Πππ 語語語 done";

$| = 1;

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n";
}

Yeah, there oughta be a better way. I don’t think there is—yet.

So for now, enjoy.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!