Choosing encoding for icu::UnicodeString

时光怂恿深爱的人放手 提交于 2019-12-11 23:24:16

问题


I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?

The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.

If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.


回答1:


I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison

UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.



来源:https://stackoverflow.com/questions/34513831/choosing-encoding-for-icuunicodestring

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!