Why doesn't ICU4J match UTF-8 sort order?

冷暖自知 提交于 2019-12-02 18:41:23

问题


I am having a hard time understanding unicode sorting order.

When I run Collator.getInstance(Locale.ENGLISH).compare("_", "#") under ICU4J 55.1 I get a return value of -1 indicating that _ comes before #.

However, looking at http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec I see that # (U+0023) comes before _ (U+005F). Why is ICU4J returning a value of -1?


回答1:


First, UTF-8 is just an encoding. It specifies how to store the Unicode code points physically, but does not handle sorting, comparisons, etc.

Now, the page you linked to shows everything in numerical Code Point order. That is the order things would sort in if using a binary collation (in SQL Server, that would be collations with names ending in _BIN and _BIN2). But the non-binary ordering is far more complex. The rules are described here: Unicode Collation Algorithm (UCA).

The base rules are found here: http://www.unicode.org/repos/cldr/tags/release-28/common/uca/allkeys_CLDR.txt

It shows:

005F  ; [*010A.0020.0002] # LOW LINE
...
0023  ; [*0290.0020.0002] # NUMBER SIGN

It is very important to keep in mind that any locale / culture can override these base rules. Hence, while the few lines noted above explain this specific circumstance, other circumstances would need to check http://www.unicode.org/repos/cldr/tags/release-28/common/collation/ to see if there are any locale-specific overrides.




回答2:


Converting Mark Ransom's comments into an answer:

  • The ordering of individual characters is based on a collation table, which has little relationship to the codepoint numbers. See: http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table
  • If you follow the first link on that page, it leads to allkeys.txt which gives the default collation ordering.
  • In particular, _ is 005F ; [*020B.0020.0002] # LOW LINE while # is 0023 ; [*0391.0020.0002] # NUMBER SIGN. Note that the collation numbers for _ are lower than the numbers for #.


来源:https://stackoverflow.com/questions/32705178/why-doesnt-icu4j-match-utf-8-sort-order

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!