What are the key decisions to get right when designing a fully Unicode-aware language or library?

孤街醉人 提交于 2020-01-14 03:19:09

问题


Looking at Tom Christiansen's talk

   🔫 Unicode Support Shootout

       👍 The Good, the Bad, & the (mostly) Ugly 👎

working with text seems to be so incredibly hard, that there is no programming language (except Perl 6) which gets it even remotely correct.

What are the key design decisions to make to have a chance to implement Unicode support correctly on a clean table (i. e. no backward-compatibility requirements).

What about default file encodings, which transfer format and normalization format to use internally and for strings? What about case-mapping and case-folding? What about locale- and RTL-support? What about Regex engines as defined by UTS#18? How should common APIs look like?


回答1:


EDIT: I'll add more as I think of them.

You need no existing code that you have to support. A legacy of code that requires that everything be in 8- or 16-bit unit code units is a royal pain. It makes even libraries awkward when you have to support pre-existing models that don't consider this.

You have to work with blind people only so fonts are no issue. :)

You have to follow the Unicode rules for identifier characters, and pattern syntax characters. You should normalize your identifiers internally. If your language is itself LTR, you may not wish to allow RTL idents; unclear here.

You need to provide primitives in your language that map to Unicode concepts, like instead of just uppercase and lowercase, you need uppercase, titlecase, lowercase, and foldcase (or lc, uc, tc, and fc).

You need to give full access to the Unicode Character Database, including all character properties, so that the various tech reports' algorithms can be easily built up using them.

You need a clear logical model that is easily extensible to graphemes as needed. Just as people have come to realize a code point interface is vastly more important than a code unit one, you have to be able to deal with graphemes, etc. For example, nobody in their right mind should be forced to rewrite:

printf "%-10.10s", $string;

as this every time:

# this library treats strings as sequences of
# extended grapheme clusters for indexing purposes etc.
use Unicode::GCString;

my $gcstring = Unicode::GCString->new($string);
my $colwidth = $gcstring->columns();
if ($colwidth > 10) {
    print $gcstring->substr(0,10);
} else {
    print " " x (10 - $colwidth);
    print $gcstring;
}

You have to do it that way, BTW, because you have to have a notion of print columns, which can be 0 for combining and control characters, or 2 for characters with certain East Asian Width properties. Etc. It would be much better if there was no existing printf code so you could start from scratch and do it right. I have no idea what to do about RTL scripts' widths.

The operating system is a pre-existing code-unit library.

You need not to interact with the filesystem name space, as you have no control over whether filesystem A runs things through NFD (Linux, I believe), filesystem B runs things through NFC (HSF+, nearly), or filesystem C (traditional Unix) doesn't no any at all. Alternately, it is possible that you might be able to provide an abstraction layer here with local filters to hide some of that from the user if possible. Operating systems always have code-unit limits, not code-point ones, which is going to annoy you.

Other things with code-unit stipulations include databases that allocate fixed-size records. Fixed size just doesn't work: it's grapheme-hostile, and normalization form hostile.



来源:https://stackoverflow.com/questions/7131023/what-are-the-key-decisions-to-get-right-when-designing-a-fully-unicode-aware-lan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!