Choosing encoding for icu::UnicodeString

问题

I found myself in need of a way to change a string to lower case that was safe to use for ASCII and for UTF16-LE (as found in some windows registry strings) and came across this question: How to convert std::string to lower case?

The answer that seemed to be the "most correct" to me (I'm not using Boost) was one that demonstrated using the icu library.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

ISO-8859-1 has worked for the few unit tests I've run against ASCII encoded strings that used only Latin characters, but I don't like using it if I don't know why.

If it matters, I'm mainly concerned with manipulating English data that is typically stored in ASCII, but the windows registry has the ability to store things in UTF-16LE and I don't want to block myself from supporting other languages down the road by littering my code with non-unicode safe stuff.

回答1:

I found myself in need of a way to change a string to lower case for the purpose of case-insensitive string comparison

UnicodeString in ICU has many caseCompare() methods for performing comparisons "case-insensitively using full case folding". You don't need to transform your strings manually.

In this answer, he specified the encoding "ISO-8859-1" for the UnicodeString constructor. Why is this the correct value and how do I know what to use?

Because the author is passing an ISO-8859-1 encoded char* string literal to the constructor. UnicodeString represents a UTF-16 encoded string. If you construct it using a char* as input, you have to specify the correct charset the input data is encoded with so UnicodeString can decode it to Unicode and then re-encode it as UTF-16.

来源：https://stackoverflow.com/questions/34513831/choosing-encoding-for-icuunicodestring

标签

c++

unicode

icu