utf-16 | 易学教程

wchar_t for UTF-16 on Linux?

阅读更多关于 wchar_t for UTF-16 on Linux?

Does it make any sense to store UTF-16 encoded text using wchar_t* on Linux? The obvious problem is that wchar_t is four bytes on Linux and UTF-16 takes usually two (or sometimes two groups of two) bytes per character. I'm trying to use a third-party library that does exactly that and it seems very confusing. Looks like things are messed up because on Windows wchar_t is two bytes, but I just want to double check since it's a pretty expensive commercial library and may be I just don't understand something. While it's possible to store UTF-16 in wchar_t , such wchar_t values (or arrays of them

What is the difference between “UTF-16” and “std::wstring”?

阅读更多关于 What is the difference between “UTF-16” and “std::wstring”?

Is there any difference between these two string storage formats? JoeG std::wstring is a container of wchar_t . The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type. UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers. Using Visual Studio, if you use wide character literals (e.g. L"Hello World" ) that contain no characters outside of the BMP , you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate

Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

阅读更多关于 Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

问题 Please clarify for me, how does UTF16 work? I am a little confused, considering these points: There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly) Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way. There are no "extra wide" functions or characters types widely used in C++ or windows,

Python字符编码

阅读更多关于 Python字符编码

1.在python2默认编码是ASCII, python3里默认是unicode 2.unicode 分为 utf-32(占4个字节),utf-16(占两个字节)，utf-8(占1-4个字节)， so utf-16就是现在最常用的unicode版本，不过在文件里存的还是utf-8，因为utf8省空间 3.在py3中encode,在转码的同时还会把string 变成bytes类型，decode在解码的同时还会把bytes变回string 　　来源： https://www.cnblogs.com/yang-ck/p/11877452.html

MD5 of an UTF16LE (without BOM and 0-Byte End) in C#

阅读更多关于 MD5 of an UTF16LE (without BOM and 0-Byte End) in C#

I've got the following problem; I need to create a method, which generates a MD5 Hash of a string. This string is for example "1234567z-äbc" (Yes with the umlaut). The actual MD5 Hash of this one is: 935fe44e659beb5a3bb7a4564fba0513 The MD5 Hash, which I need is (100% sure): 9e224a41eeefa284df7bb0f26c2913e2 My documentation says, it has to be a UTF16LE conversion without BOM and 0-Byte End of the string. The problem is the conversion to this. I have got a working example in Javascript, but for pushing bytes I am still a bit to inexperienced. /* * A JavaScript implementation of the RSA Data

Unicode string normalization in C/C++

阅读更多关于 Unicode string normalization in C/C++

问题 Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize . I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer lightweight solutions. Is there any "lightweight" solution for this? 回答1: As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization. 回答2: For Windows, there is

关于乱码问题的一些思考

阅读更多关于关于乱码问题的一些思考

前言从长沙辞职跑到深圳，找房子找工作适应新的工作环境超级忙。之前一直没时间好好写博客，今天难得有空就上来写点东西吧！都9102年了，没想到还能有那么多乱码问题。之前的工作基本上前后端统一编码就完事了；话不多说，既然遇到了就干脆搞搞明白吧！编码解码概述我们都知道计算机不能直接存储字母，数字，图片，符号等，计算机能处理和工作的唯一单位是"比特位（bit）"，一个比特位通常只有 0 和 1。利用比特位序列来代表字母，数字，图片，符号等，我们就需要一个存储规则，不同的比特序列代表不同的字符，这就是所谓的"编码"。反之，将存储在计算机中的比特位序列（或者叫二进制序列）解析显示出来成对应的字母，数字，图片和符号，称为"解码"，如同密码学中的加密和解密，下面将详细解释编码解码过程中涉及到的一些术语：字符集合（Character set）：是各种文字和符号的总称，包括各国家文字、标点符号、图形符号、数字等，简单理解就是一个字库，与计算机以及编码无关。字符编码集（Coded character set）：是一组字符对应的编码（即数字），为字符集合中的每一个字符给予一个数字，如 Unicode 为每一个字符分配一个唯一的码点与之一一对应。字符编码（Character Encoding）：简单理解就是一个映射关系，将字符集对应的码点映射为一个个二进制序列，从而使得计算机可以存储和处理

Difference between composite characters and surrogate pairs

阅读更多关于 Difference between composite characters and surrogate pairs

问题 In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts? 回答1: Surrogate pairs are a weird wart in Unicode. Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2 21 are

C++ unicode UTF-16 encoding

阅读更多关于 C++ unicode UTF-16 encoding

I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special “%uNNNN” format for encoding Unicode UTF-16 code points. In this website , it tells me it's JavaScript escapes. But I don't know how to encode it with C++. It there any library to get this to work? or give me some tips. Thanks my friends! Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may

Unicode in Python - just UTF-16?

阅读更多关于 Unicode in Python - just UTF-16?

问题 I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me this article on UTF-8 and it confused me. The author of the article indicates a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16. He even goes as far as directly saying Python uses UTF-16 for internal string representation. The author also admits to being a Windows