utf-16

Is there a standard technique for packing binary data into a UTF-16 string?

爷,独闯天下 提交于 2019-11-30 08:20:35
问题 (In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string ? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length ; because two bytes are more-or-less a single character. The two "obvious" answers don't meet all the criteria: string base64 = System

R write.csv with UTF-16 encoding

℡╲_俬逩灬. 提交于 2019-11-30 06:55:39
I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding. Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text import, non-ASCII characters show up as underscores). I've been led to believe that Excel will be happy with UTF-16LE encoding. Here's the example data.frame: > foo a b 1 á 羽 > Encoding(levels(foo$a)) [1] "UTF-8" > Encoding(levels(foo$b)) [1] "UTF-8" So I tried to output the data.frame by doing: f <- file("foo.csv", encoding="UTF-16LE") write.csv(foo,

Converting UTF-16 to UTF-8

自闭症网瘾萝莉.ら 提交于 2019-11-30 06:06:28
问题 I've loading a string from a file. When I print out the string with: print my_string print binascii.hexlify(my_string) I get: 2DF5 0032004400460035 Meaning this string is UTF-16 . I would like to convert this string to UTF-8 so that the above code produces this output: 2DF5 32444635 I've tried: my_string.decode('utf-8') Which output: 32004400460035 EDIT: Here's a quick sample: hello = 'hello'.encode('utf-16') print hello print binascii.hexlify(hello) hello = hello[2:].decode('utf-8') print

Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI

时间秒杀一切 提交于 2019-11-30 03:23:47
I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that. I already read the question titled "std::wstring VS std::string . It was very helpful, but I still don't quite understand how to apply all of that information to my problem. The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes. My feeling based on what I have read is that the HTML should be

Is there a drastic difference between UTF-8 and UTF-16

一曲冷凌霜 提交于 2019-11-30 02:27:41
I call a webservice, that gives me back a response xml that has UTF-8 encoding. I checked that in java using getAllHeaders() method. Now, in my java code, I take that response and then do some processing on it. And later, pass it on to a different service. Now, I googled a bit and found out that by default the encoding in Java for strings is UTF-16. In my response xml, one of the elements had a character É. Now this got screwed in the post processing request that I make to a different service. Instead of sending É, it sent some jibberish stuff. Now I wanted to know, will there be really a lot

findstr or grep that autodetects chararacter encoding (UTF-16)

不想你离开。 提交于 2019-11-30 01:41:03
问题 I want to do this: findstr /s /c:some-symbol * or the grep equivalent grep -R some-symbol * but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark FFEE in them so I'm not even looking for heroic autodetection. Any suggestions? I'm referring to Windows Vista and XP. 回答1: Thanks for the suggestions. I was referring to Windows Vista and XP. I also discovered this workaround, using free Sysinternals

Is there any reason to prefer UTF-16 over UTF-8?

北城以北 提交于 2019-11-30 01:31:27
Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16. However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information. Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well? EDIT: Meanwhile I've also found this answer , which seems relevant and has some interesting links. East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of

Unicode与UTF-8、UTF-16、UTF-32

假装没事ソ 提交于 2019-11-29 21:51:30
Unicode Unicode 是为了解决传统的字符编码方案的局限而产生的,它为每种语言中的每个字符设定了 统一 并且 唯一 的二进制编码,以满足跨语言、跨平台进行文本转换、处理的要求。 起源: 因为计算机只能处理数字,如果要处理文本,就必须先把文本转换为数字才能处理。最早出现的ASCII码表就是一种用来表示大小写英文字母、数字和一些符号的统一编码表。 但是,如果要表示中文、日语、韩语等,显然一个字节是不够的,至少需要 两个字节 ,而且还不能和ASCII编码冲突,所以需要一个统一所有文字的编码,于是Unicode应运而生。 Unicode通常用两个字节表示一个字符,原有的英文编码从单字节变成双字节,只需要把 高字节全部填为0 就可以。在表示一个Unicode的字符时,通常会用" U+ "然后紧接着一组 十六进制的数字 来表示这一个字符。 作用: Unicode能够使计算机实现 跨语言 、 跨平台 的文本转换及处理。 方式: Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案。目前的Unicode字符分为17组编排,0x0000 至 0x10FFFF。 UTF-8 、 UTF-16 、 UTF-32 都是将数字转换到程序数据的 编码方案 。 UTF-8 定义: UTF-8以字节为单位对Unicode进行编码。 从Unicode到UTF-8的 编码方式 如下:

Unicode,UTF-8,UTF-16,UTF-32

為{幸葍}努か 提交于 2019-11-29 21:30:51
什么是Unicode,UTF-8,UTF-16,UTF-32?要知道他们是什么或者厘清他们之间的联系,需要从我们知道的Ascll码说起。 Ascll 在计算机中,每八个二进制组成了一个字节(Byte),所以以前人们用8为二进制码来编码英文字母(第一位是0),比如“01000001”代表大写字母A。但是这样不同的组成只有128个,在美国128个字母够了,但是在世界各地,各种语言就不够用了,于是,许多国家将第一位变为1,又生出了许多不同的字符。与此同时,新的问题便产生了,不同国家对新增的128个数字赋予了不同的含义。所以不同的国家有不同的编码方式,所以不知道对方的编码方式,就会导致乱码。 Unicode 针对此现象,Unicode出现了。Unicode为世界上所有字符都分配了一个唯一的数字编号,这个编号的范围从0x000000到0x10FFFF,有110多万,这个编号一般写成16进制,在前面加上U+,例如“马”的Unicode是U+9A6C。Unicode本身只规定了 每个字符的数字编号 是多少,并没有规定这个编号如何存储。对于存储,便产生了多种方案:主要有UTF-8,UTF-16,UTF-32。 UTF-32 这个就是字符所对应编号的整数二进制形式,四个字节。这种存储方式即最简单直接的直接转换,如将“马”的U+9A6C直接转化为二进制1001101001101100。 注

Utf8_general_ci or utf8mb4 or…?

假装没事ソ 提交于 2019-11-29 20:41:21
utf16 or utf32? I'm trying to store content in a lot of languages. Some of the languages use double-wide fonts (for example, Japanese fonts are frequently twice as wide as English fonts). I'm not sure which kind of database I should be using. Any information about the differences between these four charsets... Ignacio Vazquez-Abrams MySQL's utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character. utf8mb3 and the original utf8 can only store