utf-16

Does the Unicode Consortium Intend to make UTF-16 run out of characters? [closed]

拥有回忆 提交于 2019-12-10 14:49:14
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF . Does the Unicode Consortium Intend to make UTF-16 run out of characters? i.e. make a code point > 0x10FFFF If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte

Any way to convert a regular string in ActionScript 3 to a ByteArray of Latin-1 Character Codes?

醉酒当歌 提交于 2019-12-10 13:12:15
问题 I am having no problem converting a string to a byteArray of UTF-16 encoded characters, but the application I am trying to communicate with (written in Erlang) only understands Latin-1 encoding. Is there any way of producing a byteArray full of Latin-1 character codes from a string within Actionscript 3? 回答1: byteArray.writeMultiByte(string, "iso-8859-1"); http://livedocs.adobe.com/flash/9.0/ActionScriptLangRefV3/flash/utils/ByteArray.html#writeMultiByte() 来源: https://stackoverflow.com

URL encode ASCII/UTF16 characters

房东的猫 提交于 2019-12-10 11:26:14
问题 I'm trying to URL-encode some strings, however I have problems with methods provided by the .Net framework. For instance, I'm trying the encode strings that contain the 'â' character. According to w3schools for instance, I would expect this caracter to be encoded as '%E2' (and a PHP system I must call expects this too...). I tried using these methods: System.Web.HttpUtility.UrlEncode("â"); System.Web.HttpUtility.UrlPathEncode("â"); Uri.EscapeUriString("â"); Uri.EscapeDataString("â"); However,

adding backslash to fix character encoding in ruby string

心已入冬 提交于 2019-12-10 11:09:39
问题 I'm sure this is very easy but I'm getting tied in a knot with all these backslashes. I have some data that I'm scraping (politely) from a website. Occasionally a sentence comes to me looking something like this: u00a362 000? you must be joking Which should of course be '£2 000? you must be joking'. A short test in irb deciphered it. ruby-1.9.2-p180 :001 > string = "u00a3" => "u00a3" ruby-1.9.2-p180 :002 > string = "\u00a3" => "£" Of course: add a backslash and it will be decoded. I created

Using iconv to convert from UTF-16BE to UTF-8 without BOM

∥☆過路亽.° 提交于 2019-12-10 02:15:52
问题 I'm trying to convert a UTF-16BE encoded file (byte order mark: 0xFE 0xFF) to UTF-8 using iconv like so: iconv -f UTF-16BE -t UTF-8 myfile.txt The resulting output, however, has the UTF-8 byte order mark (0xEF 0xBB 0xBF) and that is not what I need. Is there a way to tell iconv (or is there an equivalent encoding) to not put a BOM in the UTF-8 result? 回答1: Experiment shows that indicating UTF-16 rather than UTF-16BE does what you want: iconv -f UTF-16 -t UTF-8 myfile.txt 来源: https:/

Unicode、UTF-8、UTF-16之间的关系

放肆的年华 提交于 2019-12-09 18:08:45
1、为什么需要Unicode 在很早以前所有,在计算机的世界里只有ASCII,后来多了一些控制字符、标点等,最后就是今天的世界里你能够看到很多种语言在一个文档中,例如:English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ ,后期或许会出现更多的其他语言的字符,计算机中需要显示所有的这些语言的字符。因此:一个包容所有语言字符的 字符集 很有必要,这就是Unicode的诞生的意义。 2、Unicode简介 Unicode是一个包含世界上所有语言字符的 字符集 ,它为世界上每一个字符分配一个唯一的数字,官方术语叫 code point(码位)。Unicode的一个很大的优点是,码位的前256位和ISO-8859-1以及ASCII一样。大部分常用的字符通过一到两个字节就可以表示。 3、为什么需要UTF-8或者UTF-16等编码 虽然Unicode能够包容所有的字符集,但是我们直接看Unicode码很不方便,像看天书一样,我们对我们常用的文字最熟悉,所以就需要把我们常用的可读性强的文字和Unicode字符集一一对应。这个过程叫编码。 常用的UTF-8、GBK、UTF-16等都是不同的编码方式,这些都是把我们看到的文字和Unicode字符集对应起来的规则。 4、UTF-8和UTF-16之间的区别 1、基于内存考虑的比较: UTF-8:

How does Microsoft handle the fact that UTF-16 is a variable length encoding in their C++ standard library implementation

扶醉桌前 提交于 2019-12-09 15:32:40
问题 Having a variable length encoding is indirectly forbidden in the standard. So I have several questions: How is the following part of the standard handled? 17.3.2.1.3.3 Wide-character sequences A wide-character sequence is an array object (8.3.4) A that can be declared as T A[N], where T is type wchar_t (3.9.1), optionally qualified by any combination of const or volatile. The initial elements of the array have defined contents up to and including an element determined by some predicate. A

c++11 string 转ustring UTF-8 UTF-16 UTF32

落爺英雄遲暮 提交于 2019-12-09 13:56:29
#include <locale> #include <codecvt> #pragma warning(disable:4996) //u8string to wstring std::wstring utf8_to_wstring(const std::string& str) { std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > strCnv; return strCnv.from_bytes(str); } //wstring to string //wstring to u8string std::string utf8_to_wstring(const std::wstring& str) { std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > strCnv; return strCnv.to_bytes(str); } //wstring to string std::string wstring_to_string(const std::wstring& str) { std::wstring_convert< std::codecvt_utf8_utf16<wchar_t> > strCnv; return strCnv.to_bytes

UTF8 与 UTF8 +BOM 区别

房东的猫 提交于 2019-12-09 11:01:12
一个带标签,一个没有标签。 BOM是Byte Order Mark(定义字节顺序),因为在网络传输中分两种顺序:大头和小头。 由于兼容性,带BOM的utf-8在一些browser中显示为乱码。 网上搜索了关于Byte Order Mark的信息: 在UCS 编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符,它的编码是FEFF。而FFFE在UCS中是不存在的字符,所以不应该出现在实际传输中。UCS规范建 议我们在传输字节流前,先传输字符"ZERO WIDTH NO-BREAK SPACE"。这样如果接收者收到FEFF,就表明这个字节流是Big-Endian的;如果收到FFFE,就表明这 个字节流是Little- Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。 UTF-8不需要BOM来表明字节顺序,但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF。所以如果接收者收到以EF BB BF 开头的字节流,就知道这是UTF-8编码了。 Windows就是使用BOM来标记文本文件的编码方式的。 带BOM的UTF-8,所有PHP无法识别,直接将EF BB BF输出,在charset="utf-8"的页面中是空白

Should I change from UTF-8 to UTF-16 to accommodate Chinese characters in my HTML?

爷,独闯天下 提交于 2019-12-09 09:46:05
问题 I am using ASP.NET MVC, MS SQL and IIS. I have a few users that have used Chinese characters in their profile info. However, when I display this information is shows up as æŽå¼·è¯ but they are correct in my database. Currently my UTF for my HTML pages is set to UTF-8. Should I change it to UTF-16? I understand there are a few problems that can come from this but what are my choices? Thank you, Aaron 回答1: UTF-8 and UTF-16 encode exactly the same set of characters. It's not that UTF-8 doesn't