utf-16

UTF-16 safe substring in C# .NET

匿名 (未验证) 提交于 2019-12-03 01:33:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I want to get a substring of a given length say 150. However, I want to make sure I don't cut off the string in between a unicode character. e.g. see the following code: Here substr is an invalid string since the smiley character is cut in half. Instead I want a function that does as follows: where substr For reference, here is how I would do it in Objective-C using rangeOfComposedCharacterSequencesForRange What is the equivalent code in C#? 回答1: This should return the maximal substring starting at index startIndex and with length up to

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

六月ゝ 毕业季﹏ 提交于 2019-12-03 01:15:48
问题 I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). 回答1: Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know

How default is the default encoding (UTF-8) in the XML Declaration?

匿名 (未验证) 提交于 2019-12-03 01:07:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I know that the default encoding of XML is UTF-8 . All XML consumers MUST and so on and so forth. So this is not just a question whether or not XML has a default encoding. I also know that the XML-Declarataion at the beginning of the document itself is optional. And that specifying the encoding therein is optional as well. So I ask myself if the following two XML-Declarations are two expressions for the exact same thing: From my own current understanding I would say those are equivalent but I do not known . Has the equivalence of

unicode 、utf-8 、utf-16、ascii 、gbk 、gb2312 的区别

匿名 (未验证) 提交于 2019-12-03 00:39:02
很久很久以前,有一群人,他们决定用8个可以开合的晶体管来组合成不同的状态,以表示世界上的万物。他们看到8个开关状态是好的,于是他们把这称为” 字节 “。再后来,他们又做了一些可以处理这些字节的机器,机器开动了,可以用字节来组合出很多状态,状态开始变来变去。他们看到这样是好的,于是它们就这机器称为” 计算机 “。 开始计算机只在美国用。八位的字节一共可以组合出256(2的8次方)种不同的状态。 他们把其中的编号从0开始的32种状态分别规定了特殊的用途,一但终端、打印机遇上约定好的这些字节被传过来时,就要做一些约定的动作: 遇上0×10, 终端就换行; 遇上0×07, 终端就向人们嘟嘟叫; 遇上0x1b, 打印机就打印反白的字,或者终端就用彩色显示字母。 他们看到这样很好,于是就把这些0×20以下的字节状态称为”控制码”。他们又把所有的空 格、标点符号、数字、大小写字母分别用连续的字节状态表示,一直编到了第127号,这样计算机就可以用不同字节来存储英语的文字了。大家看到这样,都感觉 很好,于是大家都把这个方案叫做 ANSI 的”Ascii”编码(American Standard Code for Information Interchange,美国信息互换标准代码)。当时世界上所有的计算机都用同样的 ASCII 方案来保存英文文字。 后来,就像建造巴比伦塔一样

Which encoding does Java uses UTF-8 or UTF-16?

ⅰ亾dé卋堺 提交于 2019-12-03 00:37:45
I've already read the following posts: What is the Java's internal represention for String? Modified UTF-8? UTF-16? https://docs.oracle.com/javase/8/docs/api/java/lang/String.html Now consider the code given below: public static void main(String[] args) { printCharacterDetails("最"); } public static void printCharacterDetails(String character){ System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0))); byte[] bytes = character.getBytes(); System.out.println("The UTF-8 Character="+character+" | Default: Number of Bytes="+bytes.length); String

Unicode,UTF-32,UTF-16,UTF-8

匿名 (未验证) 提交于 2019-12-03 00:34:01
我们通常说的Unicode是一个字符集,在这个字符集中每个字符都有对应的唯一十六进制值。 Unicdoe字符集包含了全球所有的字符,所以它的体积较为庞大,如此便分为了17个平面。 17个平面中第一个平面为基本平面(BMP),剩下的16个为辅助平面(SMP)。 基本平面的字符对应十六进制值的区域为 0x0000~0xFFFF ,辅助平面中的字符对应十六进制值的区域为 0x010000~0x10FFFF 查看完整的Unicode字符集( 地址1 , 地址2 ) UTF-32 UTF-32实质是一种重新编码计算的方式,是依附于Unicode字符集的。以Unicode字符集为参考基础,对其中的字符所对应的十六进制值进行重新计算获取一个新的十六进制值。如果我们对Unicode字符集中的所有字符都进行了UTF-32编码,那获得的值组合起来就可以说是一个UTF-32字符集了。 UTF-32的编码方式就是将 Unicode字符集中的字符对应的十六进制变为4个字节存储。 例如: 0x0000 => 0x00000000 0x1EC0 => 0x00001EC0 0xFFFF => 0x0000FFFF 0x010000 => 0x00010000 0x10FFFF => 0x0010FFFF 注:Unicdoe字符集中的十六进制值暂时不清楚是几个字节,但是UTF-32编码后

What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

China☆狼群 提交于 2019-12-03 00:03:42
It seems that SQL Server uses Unicode UCS-2 , a 2-byte fixed-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.) While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to

What version of Unicode is supported by which .NET platform and on which version of Windows in regards to character classes?

前提是你 提交于 2019-12-02 23:43:40
Updated question ¹ With regards to character classes, comparison, sorting, normalization and collations, what Unicode version or versions are supported by which .NET platforms? Original question I remember somewhat vaguely having read that .NET supported Unicode version 3.0 and that the internal UTF-16 encoding is not really UTF-16 but actually uses UCS-2, which is not the same. It seems, for instance, that characters above U+FFFF are not possible, i.e. consider: string s = "\u1D7D9"; // ("Mathematical double-struck digit one") and it stores the string "ᵽ9" . I'm basically looking for

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

偶尔善良 提交于 2019-12-02 14:30:26
I've managed to mostly ignore all this multi-byte character stuff, but now I need to do some UI work and I know my ignorance in this area is going to catch up with me! Can anyone explain in a few paragraphs or less just what I need to know so that I can localize my applications? What types should I be using (I use both .Net and C/C++, and I need this answer for both Unix and Windows). Check out Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) EDIT 20140523 : Also, watch Characters, Symbols and the

JS中编码操作

五迷三道 提交于 2019-12-02 11:49:11
常见的网页设计编码:Unicode的UCS-2 UCS-4 UTF-8 UTF-16 UTF-32以及ASCII和ANSI。 要知道JS最初定稿时使用的编码是UCS-2(因为那时候还没UTF-16,本质:UTF-16就是对UCS-2的扩展,前面的65536个字符就是UCS-2本身,后面的一直到10FFFF编号的字符是UTF-16新增的字符集),后来的ES6标准中加入了对UTF-16的支持。 题外话:现在已经没有UCS编码了,UCS的组织是ISO(国际标准化组织),而Unicode是多个多语言软件公司组成的一个组织,他们最后都同意整合各自的编码规则使世界字符编码规则趋于统一。最后的统一世界编码的项目名便是Unicode,UCS-2是旧时对65536个字符的统一编码的解决方案,而UCS-4是对到10FFFF编号的字符解决方案,UTF-32就是UCS-4(始终用4字节来存储字符),变了个新名字而已。UTF-8和16有对应的各自编码转换规则,这里就不说了。GBK的话就是编码序号对应每个汉字(一对一查表,无法通过计算获取对应字符,是地方语言编码规则,仅限在中国大陆使用),无对应的转换规则可用。再说一下UTF-16代理对的含义,在UTF-16中为了要表示剩下的16个平面字符(UTF-16范围是到10FFFF为止的,共17个平面),在基本平面的D800~DFFF这些码点设为代理