utf-16 | 易学教程

Deprecated header <codecvt> replacement

阅读更多关于 Deprecated header replacement

问题 A bit of foreground: my task required converting UTF-8 XML file to UTF-16 (with proper header, of course). And so I searched about usual ways of converting UTF-8 to UTF-16, and found out that one should use templates from <codecvt> . But now when it is deprecated, I wonder what is the new common way of doing the same task? (Don\'t mind using Boost at all, but other than that I prefer to stay as close to standard library as possible.) 回答1: std::codecvt template from <locale> itself isn't

JavaScript strings outside of the BMP

阅读更多关于 JavaScript strings outside of the BMP

问题 BMP being Basic Multilingual Plane According to JavaScript: the Good Parts : JavaScript was built at a time when Unicode was a 16-bit character set, so all characters in JavaScript are 16 bits wide. This leads me to believe that JavaScript uses UCS-2 (not UTF-16!) and can only handle characters up to U+FFFF. Further investigation confirms this: > String.fromCharCode(0x20001); The fromCharCode method seems to only use the lowest 16 bits when returning the Unicode character. Trying to get U

UTF-8, UTF-16, and UTF-32

阅读更多关于 UTF-8, UTF-16, and UTF-32

问题 What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other? 回答1: UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes all characters into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters

What is a “surrogate pair” in Java?

阅读更多关于 What is a “surrogate pair” in Java?

问题 I was reading the documentation for StringBuffer , in particular the reverse() method. That documentation mentions something about surrogate pairs . What is a surrogate pair in this context? And what are low and high surrogates? 回答1: The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding

Can I make git recognize a UTF-16 file as text?

阅读更多关于 Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file is text and handle it appropriately? I'm using git under Cygwin, with core.autocrlf set to false. I could use mSysGit or git under UNIX, if necessary. I've been struggling with this problem for a while, and just discovered (for me) a perfect solution: $ git config --global diff.tool vimdiff # or merge.tool to get merging too! $ git difftool commit1

What is the Java's internal represention for String? Modified UTF-8? UTF-16?

阅读更多关于 What is the Java's internal represention for String? Modified UTF-8? UTF-16?

问题 I searched Java\'s internal representation for String, but I\'ve got two materials which look reliable but inconsistent. One is: http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451 and it says: Java uses UTF-16 for the internal text representation and supports a non-standard modification of UTF-8 for string serialization. The other is: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 and it says: Tcl also uses the same modified UTF-8[25] as Java for internal

What is Unicode, UTF-8, UTF-16?

阅读更多关于 What is Unicode, UTF-8, UTF-16?

问题 What\'s the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it\'s not clear to me. In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF\'s. Why would this be the case? Please explain in simple terms. 回答1: Why do we need Unicode? In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters,

What is a “surrogate pair” in Java?

阅读更多关于 What is a “surrogate pair” in Java?

I was reading the documentation for StringBuffer , in particular the reverse() method. That documentation mentions something about surrogate pairs . What is a surrogate pair in this context? And what are low and high surrogates? The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF. Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can