utf-16

What Character Encoding is best for multinational companies

元气小坏坏 提交于 2019-12-03 15:50:49
问题 If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128? If so do all browsers understand the chosen encoding? Is character encoding straight forward to implement or are there hidden factors? Thanks in advance. 回答1: If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this

Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

爱⌒轻易说出口 提交于 2019-12-03 15:14:47
Please clarify for me, how does UTF16 work? I am a little confused, considering these points: There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly) Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way. There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed. To my uncertain knowledge, unicode has a lot

Unicode string normalization in C/C++

試著忘記壹切 提交于 2019-12-03 11:42:06
Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize . I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer lightweight solutions. Is there any "lightweight" solution for this? Avi As I wrote in another question , utf8proc is a very nice, lightweight, library for basic Unicode functionality, including Unicode string normalization. For Windows, there is the NormalizeString() function (unfortunately for Vista and later only - as far as I see on MSDN): http://msdn

How to read utf16 text file to string in golang?

假如想象 提交于 2019-12-03 11:32:51
问题 I can read the file to bytes array but when I convert it to string it treat the utf16 bytes as ascii How to convert it correctly? package main import ("fmt" "os" "bufio" ) func main(){ // read whole the file f, err := os.Open("test.txt") if err != nil { fmt.Printf("error opening file: %v\n",err) os.Exit(1) } r := bufio.NewReader(f) var s,b,e = r.ReadLine() if e==nil{ fmt.Println(b) fmt.Println(s) fmt.Println(string(s)) } } output: false [255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73

Which encoding does Java uses UTF-8 or UTF-16?

故事扮演 提交于 2019-12-03 10:59:00
问题 I've already read the following posts: What is the Java's internal represention for String? Modified UTF-8? UTF-16? https://docs.oracle.com/javase/8/docs/api/java/lang/String.html Now consider the code given below: public static void main(String[] args) { printCharacterDetails("最"); } public static void printCharacterDetails(String character){ System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0))); byte[] bytes = character.getBytes(); System.out

What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

て烟熏妆下的殇ゞ 提交于 2019-12-03 09:44:59
问题 It seems that SQL Server uses Unicode UCS-2 , a 2-byte fixed-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.) While UCS-2 encodes the same basic code points as UTF-16

Python - Decode UTF-16 file with BOM

二次信任 提交于 2019-12-03 09:34:43
问题 I have a UTF-16 LE file with BOM. I'd like to flip this file in to UTF-8 without BOM so I can parse it using Python. The usual code that I use didn't do the trick, it returned unknown characters instead of the actual file contents. f = open('dbo.chrRaces.Table.sql').read() f = str(f).decode('utf-16le', errors='ignore').encode('utf8') print f What would be the proper way to decode this file so I can parse through it with f.readlines() ? 回答1: Firstly, you should read in binary mode, otherwise

刨根究底字符编码之十四——UTF-16究竟是怎么编码的(“代理区(Surrogate Zone)”,范围为0xD800~0xDFFF(十进制55296~57343),共2048个码点未定义。UTF8和UTF32没有这个问题)

[亡魂溺海] 提交于 2019-12-03 08:56:13
1. 首先要注意的是,代理Surrogate是专属于UTF-16编码方式的一种机制,UTF-8和UTF-32是不用代理的。 如前文所述,为了让UTF-16能继续编码基本平面后面的增补平面中的码点值,于是扩展了UTF-16编码方式。 具体的扩展方法就是为其增加了代理机制,用两个对应于基本平面码点(即BMP代理区中的码点)的16位码元来表示一个增补平面码点,这两个用来表示一个增补平面码点的特殊16位码元就被称为“代理对”。 如果要用简单的一句话来概括,就是——所有大于0xFFFF的码点值(即增补平面码点编号,范围为0x10000~0x10FFFF,十进制为65536~1114111;注意,0xFFFF是十六位二进制数的最大值的十六进制表示)要编码成UTF-16编码方式的话,就必须使用代理机制(也就是用代理对来表示)。 2. 在UTF-16编码方式中,被合起来称为“代理对”的这两个16位码元就其中的任一单个码元而言,其实就直接对应于基本平面BMP中的某一个码点(即BMP中每一个码点的值必然对应于一个16位码元的值,因为基本平面中的码点总数为2^16=65536个,而16位码元能表示的值也等于2^16=65536个)。 这样一来,就产生了冲突:某个UTF-16码元到底是用于表示基本平面字符的码元,还是用于表示增补平面字符的代理对中的代理码元? 因此,为避免冲突,这些被用作“代理

UTF-16 perl input output

匿名 (未验证) 提交于 2019-12-03 08:54:24
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file. use open "encoding(UTF-16)"; open INPUT, "< input.txt" or die "cannot open > input.txt: $!\n"; open(OUTPUT,"> output.txt"); while(<INPUT>) { print OUTPUT "$_\n" } Let's just say that my program writes everything from input.txt into output.txt. This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int" But in my Windows environment, which

Convert between string, u16string &amp; u32string

匿名 (未验证) 提交于 2019-12-03 08:52:47
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I've been looking for a way to convert between the Unicode string types and came across this method . Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods. If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods. 回答1: mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All