utf-16

Unicode in Python - just UTF-16?

荒凉一梦 提交于 2019-12-03 08:17:35
I was happy in my Python world knowing that I was doing everything in Unicode and encoding as UTF-8 when I needed to output something to a user. Then, one of my colleagues sent me this article on UTF-8 and it confused me. The author of the article indicates a number of times that UCS-2, the Unicode representation that Python uses is synonymous with UTF-16. He even goes as far as directly saying Python uses UTF-16 for internal string representation. The author also admits to being a Windows lover and developer and states that the way MS has handled character encodings over the years has led to

Difference between composite characters and surrogate pairs

末鹿安然 提交于 2019-12-03 08:14:27
In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts? Kerrek SB Surrogate pairs are a weird wart in Unicode. Unicode itself is nothing other than an abstract assignment of meaning to numbers. That's what an encoding is. Capital-letter-A, Greek-alternate-terminal-sigma, Klingon-closing-bracket-2, etc. currently, numbers up to about 2 21 are available, though not all are in use. In the context of Unicode, each number is know as a code point .

Python字符编码

家住魔仙堡 提交于 2019-12-03 07:16:16
1.1. ASCII ASCII(American Standard Code for Information Interchange),是一种单字节的编码。计算机世界里一开始只有英文,而单字节可以表示256个不同的字符,可以表示所有的英文字符和许多的控制符号。不过ASCII只用到了其中的一半(\x80以下),这也是MBCS得以实现的基础。 1.2. MBCS 然而计算机世界里很快就有了其他语言,单字节的ASCII已无法满足需求。后来每个语言就制定了一套自己的编码,由于单字节能表示的字符太少,而且同时也需要与ASCII编码保持兼容,所以这些编码纷纷使用了多字节来表示字符,如 GBxxx、 BIGxxx等等,他们的规则是,如果第一个字节是\x80以下,则仍然表示ASCII字符;而如果是\x80以上,则跟下一个字节一起(共两个字节)表示一个字符,然后跳过下一个字节,继续往下判断。 这里,IBM发明了一个叫Code Page的概念,将这些编码都收入囊中并分配页码,GBK是第936页,也就是 CP936。所以,也可以使用CP936表示GBK。 MBCS(Multi-Byte Character Set)是这些编码的统称。目前为止大家都是用了双字节,所以有时候也叫做 DBCS(Double-Byte Character Set)。必须明确的是,MBCS并不是某一种特定的编码

What Character Encoding is best for multinational companies

拈花ヽ惹草 提交于 2019-12-03 06:08:26
If you had a website that was to be translated into every language in the world and therefore had a database with all these translations what character encoding would be best? UTF-128? If so do all browsers understand the chosen encoding? Is character encoding straight forward to implement or are there hidden factors? Thanks in advance. If you want to support a variety of languages for web content, you should use an encoding that covers the entire Unicode range. The best choice for this purpose is UTF-8. UTF-8 is the preferred encoding for the web; from the HTML5 draft standard : Authors are

What's the point of UTF-16?

十年热恋 提交于 2019-12-03 04:02:21
问题 I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8. What are the advantages of UTF-16 over UTF-8 and UTF-32 and why do Windows and Java use it as their native encoding? 回答1: When Windows NT was designed UTF-16 didn't exist (NT 3.51 was born

at all times text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16. true / false?

匿名 (未验证) 提交于 2019-12-03 02:50:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Somewhere I read (rephrased): If we compare a UTF-8 encoded file VS a UTF-16 encoded file, At some times, the UTF-8 file may give a 50% to 100% larger file size Am I right to say that the article is wrong because at all times , text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16? 回答1: The answer is that in UTF-8, ASCII is just 1 byte, but that in general, most Western languages including English use a few characters here and there that require 2 bytes, so actual percentages vary. The Greek

What does “The .NET framework uses the UTF-16 encoding standard by default” mean?

匿名 (未验证) 提交于 2019-12-03 02:47:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: My study guide (for 70-536 exam) says this twice in the text and encoding chapter, which is right after the IO chapter. All the examples so far are to do with simple file access using FileStream and StreamWriter. It aslo says stuff like "If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16" and "Specify different encodings using Stream constructor overloads". Never mind the fact that the actual overloads are on the StreamWriter class but hey, whatever. I am looking at StreamWriter

utf-16 file seeking in python. how?

匿名 (未验证) 提交于 2019-12-03 02:02:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: For some reason i can not seek my utf16 file. It produces 'UnicodeException: UTF-16 stream does not start with BOM'. My code: f = codecs.open(ai_file, 'r', 'utf-16') seek = self.ai_map[self._cbClass.Text] #seek is valid int f.seek(seek) while True: ln = f.readline().strip() I tried random stuff like first reading something from stream, didnt help. I checked offset that is seeked to using hex editor - string starts at character, not null byte (i guess its good sign, right?) So how to seek utf-16 in python? 回答1: Well, the error message is

How to read utf16 text file to string in golang?

扶醉桌前 提交于 2019-12-03 01:56:22
I can read the file to bytes array but when I convert it to string it treat the utf16 bytes as ascii How to convert it correctly? package main import ("fmt" "os" "bufio" ) func main(){ // read whole the file f, err := os.Open("test.txt") if err != nil { fmt.Printf("error opening file: %v\n",err) os.Exit(1) } r := bufio.NewReader(f) var s,b,e = r.ReadLine() if e==nil{ fmt.Println(b) fmt.Println(s) fmt.Println(string(s)) } } output: false [255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73 0 110 0 102 0 111 0 93 0 13 0] S c r i p t I n f o ] Update: After I tested the two examples, I have

converting utf-16 -> utf-8 AND remove BOM

匿名 (未验证) 提交于 2019-12-03 01:52:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have: batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'} batches=[batch_3] for b in batches: s_files=os.listdir(b['src']) for file_name in s_files: ff_name = os.path.join(b['src'], file_name) if (os.path.isfile(ff_name) and ff_name.endswith('.json')): print ff_name target_file_name=os.path.join(b['dest'], file_name) BLOCKSIZE =