utf-16

In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

回眸只為那壹抹淺笑 提交于 2019-12-01 03:48:26
问题 UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE. But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I found my computer is little endian, and UTF-16 is confirmed as same encoding of UTF-16LE. Also: There are two byte orders of a value (such as integer) in little/big endian computers. Little endian computers will produce little endian values in

How do I convert a string in UTF-16 to UTF-8 in C++

我与影子孤独终老i 提交于 2019-12-01 01:52:58
Consider: STDMETHODIMP CFileSystemAPI::setRRConfig( BSTR config_str, VARIANT* ret ) { mReportReaderFactory.reset( new sbis::report_reader::ReportReaderFactory() ); USES_CONVERSION; std::string configuration_str = W2A( config_str ); But in config_str I get a string in UTF-16. How can I convert it to UTF-8 in this piece of code? beardedN5rd If you are using C++11 you may check this out: http://www.cplusplus.com/reference/codecvt/codecvt_utf8_utf16/ You can do something like this std::string WstrToUtf8Str(const std::wstring& wstr) { std::string retStr; if (!wstr.empty()) { int sizeRequired =

How to convert Rust strings to UTF-16?

让人想犯罪 __ 提交于 2019-12-01 01:21:31
问题 Editor's note: This code example is from a version of Rust prior to 1.0 and is not valid Rust 1.0 code, but the answers still contain valuable information. I want to pass a string literal to a Windows API. Many Windows functions use UTF-16 as the string encoding while Rust's native strings are UTF-8. I know Rust has utf16_units() to produce a UTF-16 character iterator, but I don't know how to use that function to produce a UTF-16 string with zero as last character. I'm producing the UTF-16

How to get a reliable unicode character count in Python?

大兔子大兔子 提交于 2019-11-30 19:26:05
Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there

Why does Powershell file concatenation convert UTF8 to UTF16?

妖精的绣舞 提交于 2019-11-30 16:53:23
问题 I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm (where xx is a two digit sequential number) and the number of files created varies from run to run. $metadataPath = "\\ServerPath\foo" function concatenateMetadata { $cFile = $metadataPath + "whiconcat.csv" Clear-Content $cFile $metadataFiles = gci $metadataPath $iterations = $metadataFiles.Count for ($i=0;$i -le $iterations-1;$i++) { $iFile = "whidata"+$i+".htm"

How to force UTF-16 while reading/writing in Java?

喜你入骨 提交于 2019-11-30 15:06:55
I see that you can specify UTF-16 as the charset via Charset.forName("UTF-16") , and that you can create a new UTF-16 decoder via Charset.forName("UTF-16").newDecoder() , but I only see the ability to specify a CharsetDecoder on InputStreamReader 's constructor. How so how do you specify to use UTF-16 while reading any stream in Java? Input streams deal with raw bytes. When you read directly from an input stream, all you get is raw bytes where character sets are irrelevant. The interpretation of raw bytes into characters, by definition, requires some sort of translation: how do I translate

java字符编码浅析

ⅰ亾dé卋堺 提交于 2019-11-30 14:31:48
关于这篇文章其实是从一个问题开始的:java中char类型能存储汉字吗? UTF-8编码 UTF-8就是在互联网上使用最广的一种Unicode的实现方式。其他实现方式还包括UTF-16(字符用两个字节或四个字节表示)和UTF-32(字符用四个字节表示),不过在互联网上基本不用。重复一遍,这里的关系是,UTF-8是Unicode的实现方式之一。UTF-8最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。 UTF-8的编码规则很简单,只有二条: 1.对于单字节的符号,字节的第一位设为0,后面7位为这个符号的unicode码。因此对于英语字母,UTF-8编码和ASCII码是相同的。 2.对于n字节的符号(n>1),第一个字节的前n位都设为1,第n+1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的unicode码。 下表总结了编码规则,字母x表示可用编码的位。 Unicode符号范围(十六进制) UTF-8编码方式(二进制) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF

java字符编码浅析

淺唱寂寞╮ 提交于 2019-11-30 14:28:50
关于这篇文章其实是从一个问题开始的:java中char类型能存储汉字吗? UTF-8编码 UTF-8就是在互联网上使用最广的一种Unicode的实现方式。其他实现方式还包括UTF-16(字符用两个字节或四个字节表示)和UTF-32(字符用四个字节表示),不过在互联网上基本不用。重复一遍,这里的关系是,UTF-8是Unicode的实现方式之一。UTF-8最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度。 UTF-8的编码规则很简单,只有二条: 1.对于单字节的符号,字节的第一位设为0,后面7位为这个符号的unicode码。因此对于英语字母,UTF-8编码和ASCII码是相同的。 2.对于n字节的符号(n>1),第一个字节的前n位都设为1,第n+1位设为0,后面字节的前两位一律设为10。剩下的没有提及的二进制位,全部为这个符号的unicode码。 下表总结了编码规则,字母x表示可用编码的位。 Unicode符号范围(十六进制) UTF-8编码方式(二进制) 0000 0000-0000 007F 0xxxxxxx 0000 0080-0000 07FF 110xxxxx 10xxxxxx 0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF

R write.csv with UTF-16 encoding

泪湿孤枕 提交于 2019-11-30 13:15:45
问题 I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding. Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text import, non-ASCII characters show up as underscores). I've been led to believe that Excel will be happy with UTF-16LE encoding. Here's the example data.frame: > foo a b 1 á 羽 > Encoding(levels(foo$a)) [1] "UTF-8" > Encoding(levels(foo$b)) [1] "UTF-8"

Storing UTF-16/Unicode data in SQL Server

亡梦爱人 提交于 2019-11-30 08:41:56
问题 According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters. This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server. I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some