utf-16 | 易学教程

Is there a standard technique for packing binary data into a UTF-16 string?

阅读更多关于 Is there a standard technique for packing binary data into a UTF-16 string?

问题 (In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string ? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length ; because two bytes are more-or-less a single character. The two "obvious" answers don't meet all the criteria: string base64 = System

R write.csv with UTF-16 encoding

阅读更多关于 R write.csv with UTF-16 encoding

I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding. Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text import, non-ASCII characters show up as underscores). I've been led to believe that Excel will be happy with UTF-16LE encoding. Here's the example data.frame: > foo a b 1 á 羽 > Encoding(levels(foo$a)) [1] "UTF-8" > Encoding(levels(foo$b)) [1] "UTF-8" So I tried to output the data.frame by doing: f <- file("foo.csv", encoding="UTF-16LE") write.csv(foo,

Converting UTF-16 to UTF-8

阅读更多关于 Converting UTF-16 to UTF-8

问题 I've loading a string from a file. When I print out the string with: print my_string print binascii.hexlify(my_string) I get: 2DF5 0032004400460035 Meaning this string is UTF-16 . I would like to convert this string to UTF-8 so that the above code produces this output: 2DF5 32444635 I've tried: my_string.decode('utf-8') Which output: 32004400460035 EDIT: Here's a quick sample: hello = 'hello'.encode('utf-16') print hello print binascii.hexlify(hello) hello = hello[2:].decode('utf-8') print

Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

阅读更多关于 Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string . It was very helpful, but I still don't quite understand how to apply all of that information to my problem. The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes. My feeling based on what I have read is that the HTML should be

Is there a drastic difference between UTF-8 and UTF-16

阅读更多关于 Is there a drastic difference between UTF-8 and UTF-16

I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method. Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service. Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16. In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service. Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot

findstr or grep that autodetects chararacter encoding (UTF-16)

阅读更多关于 findstr or grep that autodetects chararacter encoding (UTF-16)

问题 I want to do this: findstr /s /c:some-symbol * or the grep equivalent grep -R some-symbol * but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection. Any suggestions? I'm referring to Windows Vista and XP. 回答1: Thanks for the suggestions. I was referring to Windows Vista and XP. I also discovered this workaround, using free Sysinternals

Is there any reason to prefer UTF-16 over UTF-8?

阅读更多关于 Is there any reason to prefer UTF-16 over UTF-8?

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16. However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information. Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well? EDIT: Meanwhile I've also found this answer , which seems relevant and has some interesting links. East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of

Unicode与UTF-8、UTF-16、UTF-32

阅读更多关于 Unicode与UTF-8、UTF-16、UTF-32

Unicode Unicode 是为了解决传统的字符编码方案的局限而产生的，它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。起源：因为计算机只能处理数字，如果要处理文本，就必须先把文本转换为数字才能处理。最早出现的ASCII码表就是一种用来表示大小写英文字母、数字和一些符号的统一编码表。但是，如果要表示中文、日语、韩语等，显然一个字节是不够的，至少需要两个字节，而且还不能和ASCII编码冲突，所以需要一个统一所有文字的编码，于是Unicode应运而生。 Unicode通常用两个字节表示一个字符，原有的英文编码从单字节变成双字节，只需要把高字节全部填为0 就可以。在表示一个Unicode的字符时，通常会用" U+ "然后紧接着一组十六进制的数字来表示这一个字符。作用： Unicode能够使计算机实现跨语言、跨平台的文本转换及处理。方式： Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案。目前的Unicode字符分为17组编排，0x0000 至 0x10FFFF。 UTF-8 、 UTF-16 、 UTF-32 都是将数字转换到程序数据的编码方案。 UTF-8 定义： UTF-8以字节为单位对Unicode进行编码。从Unicode到UTF-8的编码方式如下:

Unicode,UTF-8,UTF-16,UTF-32

阅读更多关于 Unicode,UTF-8,UTF-16,UTF-32

什么是Unicode,UTF-8,UTF-16,UTF-32?要知道他们是什么或者厘清他们之间的联系，需要从我们知道的Ascll码说起。 Ascll 在计算机中，每八个二进制组成了一个字节(Byte)，所以以前人们用8为二进制码来编码英文字母（第一位是0），比如“01000001”代表大写字母A。但是这样不同的组成只有128个，在美国128个字母够了，但是在世界各地，各种语言就不够用了，于是，许多国家将第一位变为1，又生出了许多不同的字符。与此同时，新的问题便产生了，不同国家对新增的128个数字赋予了不同的含义。所以不同的国家有不同的编码方式，所以不知道对方的编码方式，就会导致乱码。 Unicode 针对此现象，Unicode出现了。Unicode为世界上所有字符都分配了一个唯一的数字编号，这个编号的范围从0x000000到0x10FFFF,有110多万，这个编号一般写成16进制，在前面加上U+，例如“马”的Unicode是U+9A6C。Unicode本身只规定了每个字符的数字编号是多少，并没有规定这个编号如何存储。对于存储，便产生了多种方案：主要有UTF-8,UTF-16,UTF-32。 UTF-32 这个就是字符所对应编号的整数二进制形式，四个字节。这种存储方式即最简单直接的直接转换，如将“马”的U+9A6C直接转化为二进制1001101001101100。注

Utf8_general_ci or utf8mb4 or…?

阅读更多关于 Utf8_general_ci or utf8mb4 or…?

utf16 or utf32? I'm trying to store content in a lot of languages. Some of the languages use double-wide fonts (for example, Japanese fonts are frequently twice as wide as English fonts). I'm not sure which kind of database I should be using. Any information about the differences between these four charsets... Ignacio Vazquez-Abrams MySQL's utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character. utf8mb3 and the original utf8 can only store

订阅 utf-16