utf-16

谈谈Unicode编码

送分小仙女□ 提交于 2020-01-21 00:13:53
这是一篇程序员写给程序员的趣味读物。所谓趣味是指可以比较轻松地了解一些原来不清楚的概念,增进知识,类似于打RPG游戏的升级。整理这篇文章的动机是两个问题: 问题一: 使用Windows记事本的“另存为”,可以在GBK、Unicode、Unicode big endian和UTF-8这几种编码方式间相互转换。同样是txt文件,Windows是怎样识别编码方式的呢? 我很早前就发现Unicode、Unicode big endian和UTF-8编码的txt文件的开头会多出几个字节,分别是FF、FE(Unicode),FE、FF(Unicode big endian),EF、BB、BF(UTF-8)。但这些标记是基于什么标准呢? 问题二: 最近在网上看到一个ConvertUTF.c,实现了UTF-32、UTF-16和UTF-8这三种编码方式的相互转换。对于Unicode(UCS2)、GBK、UTF-8这些编码方式,我原来就了解。但这个程序让我有些糊涂,想不起来UTF-16和UCS2有什么关系。 查了查相关资料,总算将这些问题弄清楚了,顺带也了解了一些Unicode的细节。写成一篇文章,送给有过类似疑问的朋友。本文在写作时尽量做到通俗易懂,但要求读者知道什么是字节,什么是十六进制。 0、big endian和little endian big endian和little

Python can't encode with surrogateescape

谁说我不能喝 提交于 2020-01-16 01:36:47
问题 I have a problem with Unicode surrogates encoding in Python (3.4): >>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed If I'm not mistaken, according to Python documentation: 'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF

Bug with Python UTF-16 output and Windows line endings?

老子叫甜甜 提交于 2020-01-13 19:44:29
问题 With this code: test.py import sys import codecs sys.stdout = codecs.getwriter('utf-16')(sys.stdout) print "test1" print "test2" Then I run it as: test.py > test.txt In Python 2.6 on Windows 2000, I'm finding that the newline characters are being output as the byte sequence \x0D\x0A\x00 which of course is wrong for UTF-16. Am I missing something, or is this a bug? 回答1: Try this: import sys import codecs if sys.platform == "win32": import os, msvcrt msvcrt.setmode(sys.stdout.fileno(), os.O

How to best deal with Windows' 16-bit wchar_t ugliness?

拜拜、爱过 提交于 2020-01-10 04:28:06
问题 I'm writing a wrapper layer to be used with mingw which provides the application with a virtual UTF-8 environment. Functions which deal with filenames are wrappers which convert from UTF-8 and call the corresponding "_w" functions, and so on. The big problem I've run into is that Windows' wchar_t is 16-bit. For filesystem operations, it's not a big deal. I can just convert back and forth between UTF-8 and UTF-16, and everything will work. But the standard C multibyte/wide character conversion

美团面试官问我一个字符的String.length()是多少,我说是1,面试官说你回去好好学一下吧

牧云@^-^@ 提交于 2020-01-06 18:17:56
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> 本文首发于微信公众号:程序员乔戈里 public class testT { public static void main(String [] args){ String A = "hi你是乔戈里"; System.out.println(A.length()); } } 以上结果输出为7。 小萌边说边在IDEA中的win环境下选中String.length()函数,使用 ctrl+B快捷键 进入到String.length()的定义。 /** * Returns the length of this string. * The length is equal to the number of <a href="Character.html#unicode">Unicode * code units</a> in the string. * * @return the length of the sequence of characters represented by this * object. */ public int length() { return value.length; } 接着使用 google翻译 对这段英文进行了翻译,得到了大体意思:返回字符串的长度,这一长度等于字符串中的

Efficient binary-to-string formatting (like base64, but for UTF8/UTF16)?

末鹿安然 提交于 2020-01-03 09:05:14
问题 I have many bunches of binary data, ranging from 16 to 4096 bytes, which need to be stored to a database and which should be easily comparable as a unit (e.g. two bunches of data batch only if the lengths match and all bytes match). Strings are nice for that, but converting binary data blindly to a string is apt to cause problems due to character encoding/reinterpretation issues. Base64 was a common method for storing strings in an era when 7-bit ASCII was the norm; its 33% space penalty was

How to save Excel file as csv with UTF-16 formatting

自作多情 提交于 2020-01-02 07:49:10
问题 I am having an issue with Excel not saving my files properly. I have a list of data which is organized into three columns: String String INt. I want to read this file into a Java program to perform some calculations. Excel exporting as a .csv file causes me to lose significant data as a result of the native UTF-8 encoding. I can save it as a UTF-16 .txt file however, I get another annoying result. If i insert columns of commas between each field field it saves the commas with quotes around it

Does std::wstring support UTF-16 and UTF-32 on Windows?

旧城冷巷雨未停 提交于 2020-01-01 12:16:13
问题 I'm learning about Unicode and have a few questions that I'm hoping to get answered. 1) I've read that on Linux, a std::wstring is 4-bytes, while on Windows, it's 2-bytes. Does this mean that Linux internal support is UTF-32 while Windows it is UTF-16 ? 2) Is the use of std::wstring very similar to the std::string interface? 3) Does VC++ offer support for using a 4-byte std::wstring? 4) Do you have to change compiler options if you use std::wstring? As a sidenote, I came across a string

std::wstring length

╄→尐↘猪︶ㄣ 提交于 2020-01-01 09:43:30
问题 What is the result of std::wstring.length() function, the length in wchar_t(s) or the length in symbols? And why? TCHAR r2[3]; r2[0] = 0xD834; // D834, DD1E - musical G clef r2[1] = 0xDD1E; // r2[2] = 0x0000; // '/0' std::wstring r = r2; std::cout << "capacity: " << r.capacity() << std::endl; std::cout << "length: " << r.length() << std::endl; std::cout << "size: " << r.size() << std::endl; std::cout << "max_size: " << r.max_size() << std::endl; Output> capacity: 351 length: 2 size: 2 max

how can I convert wstring to u16string?

≡放荡痞女 提交于 2020-01-01 06:04:07
问题 I want to convert wstring to u16string in C++. I can convert wstring to string, or reverse. But I don't know how convert to u16string . u16string CTextConverter::convertWstring2U16(wstring str) { int iSize; u16string szDest[256] = {}; memset(szDest, 0, 256); iSize = WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, NULL, 0,0,0); WideCharToMultiByte(CP_UTF8, NULL, str.c_str(), -1, szDest, iSize,0,0); u16string s16 = szDest; return s16; } Error in WideCharToMultiByte(CP_UTF8, NULL, str.c_str(