Does C++ support converting between character encodings other than UTF-8, UTF-16, and UTF-32?

浪子不回头ぞ 提交于 2020-05-25 08:01:12

问题


I understand that std::codecvt<char16_t, char> in C++11 performs conversion between UTF-16 and UTF-8, and std::codecvt<char32_t, char> performs conversion between UTF-32 and UTF-8. Is it possible to convert between, say, UTF-8 and ISO 8859-1?

Consider:

const char* s = "\u00C0";

If I print this string and my terminal's encoding is set to UTF-8, I will see the character À. If I set my terminal's encoding to ISO 8859-1, however, printing that string will not print out the desired character. How would I convert s into a string that, when printed, will show the character À if my terminal's encoding is set to ISO 8859-1?

I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.


回答1:


In addition to the standard mandated encodings C++ also supports an implementation defined list of encodings via locales:

#include <locale>
#include <codecvt>
#include <iostream>

template <typename Facet>
struct usable_facet : Facet {
  using Facet::Facet;
};

using codecvt = usable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>;

int main() {
  std::wstring_convert<codecvt> convert(new codecvt(".1252")); // platform specific locale strings

  std::wstring w = convert.from_bytes("\u00C0");
}

Unfortunately one of the things about wchar_t is that the standard mandates only that it use a fixed width encoding for all locales, but there's no requirement that it use the same encoding in different locales, and so you can't portably convert to wchar_t using one locale and then convert that back to char using a different locale.

There is potentially some portable support for such conversions using functions like std::mbrtoc32 and related functions, but these are not yet widely implemented.

I understand that this can be done with a library such as iconv, but I am curious whether it can be done using only the C++ standard library. I ask this question not because I don't want to use iconv, but because I don't really understand how locales work in C++.

The locale library's design doesn't really lend itself to modern usage. C and C++ are themselves confused about encodings vs. character sets, and locales conflate lexical and orthographic issues with computational aspects such as encoding.

How locales work is a topic a bit broader than is suitable for a stackoverflow answer but there are books on the topic. You'd probably also need to read platform specific materials, because the standard doesn't really give any context for much of the functionality. For example the locale library supports message catalogues, but doesn't tell you what they are or how you'd actually make one because that's functionality is not standardized by C++.




回答2:


If you want to convert UTF-8 to ISO 8859-1 using only the facilities of the C++ standard library:

  1. Convert UTF-8 → UTF-32 (converting to UTF-16 would also work).
  2. Each encoding value <256 is ISO 8859-1, and the others not.

Since this has an answer, while almost any other desired specific encoding would not have an answer, I suspect that the question was constructed in order to be answerable.

The standard library conversions support only one other encoding, namely the unspecified multibyte encoding of the execution character set, via e.g. mbstowcs (as a matter of formal-pedantic the wide character encoding needs not be Unicode, so formally there is another unspecified encoding, but in practice it's Unicode, i.e. UTF-16 or UTF-32).


I wondered if I should add a code example, but since there’s no interest in this answer (to the question’s “I am curious whether it can be done using only the C++ standard library”) I think it would be wasted effort.



来源:https://stackoverflow.com/questions/24563521/does-c-support-converting-between-character-encodings-other-than-utf-8-utf-16

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!