Internal and external encoding vs. Unicode

有些话、适合烂在心里 提交于 2019-12-23 17:35:06

问题


Since there was a lot of missinformation spread by several posters in the comments for this question: C++ ABI issues list

I have created this one to clarify.

  1. What are the encodings used for C style strings?
  2. Is Linux using UTF-8 to encode strings?
  3. How does external encoding relate to the encoding used by narrow and wide strings?

回答1:


  1. Implementation defined. Or even application defined; the standard doesn't really put any restrictions on what an application does with them, and expects a lot of the behavior to depend on the locale. All that is really implemenation defined is the encoding used in string literals.

  2. In what sense. Most of the OS ignores most of the encodings; you'll have problems if '\0' isn't a nul byte, but even EBCDIC meets that requirement. Otherwise, depending on the context, there will be a few additional characters which may be significant (a '/' in path names, for example); all of these use the first 128 encodings in Unicode, so will have a single byte encoding in UTF-8. As an example, I've used both UTF-8 and ISO 8859-1 for filenames under Linux. The only real issue is displaying them: if you do ls in an xterm, for example, ls and the xterm will assume that the filenames are in the same encoding as the display font.

  3. That mainly depends on the locale. Depending on the locale, it's quite possible for the internal encoding of a narrow character string not to correspond to that used for string literals. (But how could it be otherwise, since the encoding of a string literal must be determined at compile time, where as the internal encoding for narrow character strings depends on the locale used to read it, and can vary from one string to the next.)

If you're developing a new application in Linux, I would strongly recommend using Unicode for everything, with UTF-32 for wide character strings, and UTF-8 for narrow character strings. But don't count on anything outside the first 128 encoding points working in string literals.




回答2:


  1. This depends on the architecture. Most Unix architectures are using UTF-32 for wide strings (wchar_t) and ASCII for (char). Note that ASCII is just 7bit encoding. Windows was using UCS-2 until Windows 2000, later versions use variable encoding UTF-16 (for wchar_t).
  2. No. Most system calls on Linux are encoding agnostic (they don't care what the encoding is, since they are not interpreting it in any way). External encoding is actually defined by your current locale.
  3. The internal encoding used by narrow and wide strings is fixed, it does not change with changing locale. By changing the locale you are chaning the translation functions that encode and decode data which enters/leaves your program (assuming you stick with standard C text functions).


来源:https://stackoverflow.com/questions/7500902/internal-and-external-encoding-vs-unicode

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!