utf-16

wchar_t for UTF-16 on Linux?

╄→гoц情女王★ 提交于 2019-12-05 11:28:50
Does it make any sense to store UTF-16 encoded text using wchar_t* on Linux? The obvious problem is that wchar_t is four bytes on Linux and UTF-16 takes usually two (or sometimes two groups of two) bytes per character. I'm trying to use a third-party library that does exactly that and it seems very confusing. Looks like things are messed up because on Windows wchar_t is two bytes, but I just want to double check since it's a pretty expensive commercial library and may be I just don't understand something. While it's possible to store UTF-16 in wchar_t , such wchar_t values (or arrays of them

What is the difference between “UTF-16” and “std::wstring”?

隐身守侯 提交于 2019-12-05 01:24:04
Is there any difference between these two string storage formats? JoeG std::wstring is a container of wchar_t . The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type. UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers. Using Visual Studio, if you use wide character literals (e.g. L"Hello World" ) that contain no characters outside of the BMP , you'll end up with UTF-16, but mostly the two concepts are unrelated. If you use characters outside the BMP, std::wstring will not translate surrogate

Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

倖福魔咒の 提交于 2019-12-05 01:23:17
问题 Please clarify for me, how does UTF16 work? I am a little confused, considering these points: There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly) Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way. There are no "extra wide" functions or characters types widely used in C++ or windows,

Python字符编码

╄→尐↘猪︶ㄣ 提交于 2019-12-04 18:27:22
1.在python2默认编码是ASCII, python3里默认是unicode 2.unicode 分为 utf-32(占4个字节),utf-16(占两个字节),utf-8(占1-4个字节), so utf-16就是现在最常用的unicode版本, 不过在文件里存的还是utf-8,因为utf8省空间 3.在py3中encode,在转码的同时还会把string 变成bytes类型,decode在解码的同时还会把bytes变回string    来源: https://www.cnblogs.com/yang-ck/p/11877452.html

MD5 of an UTF16LE (without BOM and 0-Byte End) in C#

谁都会走 提交于 2019-12-04 17:55:40
I've got the following problem; I need to create a method, which generates a MD5 Hash of a string. This string is for example "1234567z-äbc" (Yes with the umlaut). The actual MD5 Hash of this one is: 935fe44e659beb5a3bb7a4564fba0513 The MD5 Hash, which I need is (100% sure): 9e224a41eeefa284df7bb0f26c2913e2 My documentation says, it has to be a UTF16LE conversion without BOM and 0-Byte End of the string. The problem is the conversion to this. I have got a working example in Javascript, but for pushing bytes I am still a bit to inexperienced. /* * A JavaScript implementation of the RSA Data

Unicode string normalization in C/C++

余生长醉 提交于 2019-12-04 16:48:50
问题 Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize . I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer lightweight solutions. Is there any "lightweight" solution for this? 回答1: As I wrote in another question, utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization. 回答2: For Windows, there is

关于乱码问题的一些思考

依然范特西╮ 提交于 2019-12-04 15:41:42
前言 从长沙辞职跑到深圳,找房子找工作适应新的工作环境超级忙。之前一直没时间好好写博客,今天难得有空就上来写点东西吧! 都9102年了,没想到还能有那么多乱码问题。之前的工作基本上前后端统一编码就完事了;话不多说,既然遇到了就干脆搞搞明白吧! 编码解码概述 我们都知道计算机不能直接存储字母,数字,图片,符号等,计算机能处理和工作的唯一单位是"比特位(bit)",一个比特位通常只有 0 和 1。利用比特位序列来代表字母,数字,图片,符号等,我们就需要一个存储规则,不同的比特序列代表不同的字符,这就是所谓的"编码"。反之,将存储在计算机中的比特位序列(或者叫二进制序列)解析显示出来成对应的字母,数字,图片和符号,称为"解码",如同密码学中的加密和解密,下面将详细解释编码解码过程中涉及到的一些术语: 字符集合 (Character set):是各种文字和符号的总称,包括各国家文字、标点符号、图形符号、数字等,简单理解就是一个字库,与计算机以及编码无关。 字符编码集 (Coded character set):是一组字符对应的编码(即数字),为字符集合中的每一个字符给予一个数字,如 Unicode 为每一个字符分配一个唯一的码点与之一一对应。 字符编码 (Character Encoding):简单理解就是一个映射关系,将字符集对应的码点映射为一个个二进制序列,从而使得计算机可以存储和处理

Difference between composite characters and surrogate pairs

烂漫一生 提交于 2019-12-04 14:20:39
问题 In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts? 回答1: Surrogate pairs are a weird wart in Unicode. Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2 21 are

C++ unicode UTF-16 encoding

五迷三道 提交于 2019-12-04 12:44:28
I have a wide char string is L"hao123--我的上网主页", and it must be encoded to "hao123--\u6211\u7684\u4E0A\u7F51\u4E3B\u9875". I was told that the encoded string is a special “%uNNNN” format for encoding Unicode UTF-16 code points. In this website , it tells me it's JavaScript escapes. But I don't know how to encode it with C++. It there any library to get this to work? or give me some tips. Thanks my friends! Embedding unicode in string literals is generally not a good idea and is not portable; there is no guarantee that wchar_t will be 16 bits and that the encoding will be UTF-16. While this may

Unicode in Python - just UTF-16?

元气小坏坏 提交于 2019-12-04 12:34:01
问题 I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me this article on UTF-8 and it confused me. The author of the article indicates a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16. He even goes as far as directly saying Python uses UTF-16 for internal string representation. The author also admits to being a Windows