utf-16 | 易学教程

Is the XML declaration tag case sensitive?

阅读更多关于 Is the XML declaration tag case sensitive?

问题 I have what is probably a really simple, studid question but I can't find an answer to it anywhere and I need to be pretty sure about this. I have various XML files from various vendors. One of the vendors provide me an XML file with japanese characters in the file. Originally, I was having trouble processing the XML file (I'm using the MSXML SDK). The characters would come out wrong. I found that if the following was added to the XML file everything worked great. <?xml version="1.0" encoding

Read Unicode files C++

阅读更多关于 Read Unicode files C++

问题 I have a simple question to ask. I have a UTF 16 text file to read wich starts with FFFE. What are the C++ tools to deal with this kind of file? I just want to read it, filter some lines, and display the result. It looks simple, but I just have experience in work with plain ascci files and I'm in the hurry. I'm using VS C++, but I'm not want to work with managed C++. Regards Here a put a very simple example wifstream file; file.open("C:\\appLog.txt", ios::in); wchar_t buffer[2048]; file.seekg

UNICODE,GBK,UTF-8区别

阅读更多关于 UNICODE,GBK,UTF-8区别

UNICODE,GBK,UTF-8区别简单来说，unicode，gbk和大五码就是编码的值，而utf-8,uft-16之类就是这个值的表现形式．而前面那三种编码是一兼容的，同一个汉字，那三个码值是完全不一样的．如＂汉＂的uncode值与gbk就是不一样的，假设uncode为a040，gbk为b030，而uft-8码，就是把那个值表现的形式．utf-8码完全只针对uncode来组织的，如果ＧＢＫ要转ＵＴＦ－８必须先转uncode码，再转utf-8就ＯＫ了．详细的就见下面转的这篇文章．谈谈Unicode编码，简要解释UCS、UTF、BMP、BOM等名词这是一篇程序员写给程序员的趣味读物。所谓趣味是指可以比较轻松地了解一些原来不清楚的概念，增进知识，类似于打RPG游戏的升级。整理这篇文章的动机是两个问题：问题一：使用Windows记事本的“另存为”，可以在GBK、Unicode、Unicode big endian和UTF-8这几种编码方式间相互转换。同样是txt文件，Windows是怎样识别编码方式的呢？我很早前就发现Unicode、Unicode big endian和UTF-8编码的txt文件的开头会多出几个字节，分别是FF、FE（Unicode）,FE、FF（Unicode big endian）,EF、BB、BF（UTF-8）。但这些标记是基于什么标准呢？问题二

Firefox and UTF-16 encoding

阅读更多关于 Firefox and UTF-16 encoding

问题 I'm building a website with the encoding UTF-16. It means that every files (html,jsp) is encoded in UTF-18 and I set in the head of every HTML page : <meta http-equiv="content-type" content="text/html; charset=UTF-16"> My index page is correctly displayed by Chrom and IE. However, firefox doesn't render the index. It displays 2 strange characters and the full index page code : ��<!DOCTYPE html> <html> <head> <meta http-equiv="content-type" content="text/html; charset=UTF-16"> ... Do you know

platform specific Unicode semantics in Python 2.7

阅读更多关于 platform specific Unicode semantics in Python 2.7

问题 Ubuntu 11.10: $ python Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> x = u'\U0001f44d' >>> len(x) 1 >>> ord(x[0]) 128077 Windows 7: Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> x = u'\U0001f44d' >>> len(x) 2 >>> ord(x[0]) 55357 My Ubuntu experience is with the default interpreter in the

HTML5 UTF-8 中文乱码

阅读更多关于 HTML5 UTF-8 中文乱码

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>HTML5的标题</title> </head> <body> <p>HTML5的内容！Hello</p> </body> </html> 我是用记事本写的，保存后在网页上运行竟然出现了乱码。换成GB2312能正确显示中文。 <!DOCTYPE html> <html> <head> <meta charset="GB2312"> <title>HTML5的标题</title> </head> <body> <p>HTML5的内容！Hello</p> </body> </html> 但是毕竟标准不一样。还是要用Utf-8。最后发现代码没有一点问题，问题就出记事本身上。 <meta charste="utf-8"> 只是告诉浏览器要用utf-8来解释，而文档的编码，是在你保存时的选择决定的。如果保存ANSI 然后用utf-8解释，肯定是乱码。记事本的话，默认保存的文件格式是ANSI。所以在保存的时候要修改为uif-8。记事本编写的同鞋一定要注意了。搞定~ 科普： UTF-8 GBK UTF8 GB2312 之间的区别和关系 UTF-8 GBK UTF8 GB2312 之间的区别 UTF-8：Unicode TransformationFormat-8bit

json string with utf16 char cannot convert from 'const char [566]' to 'std::basic_string<_Elem,_Traits,_Ax>'

阅读更多关于 json string with utf16 char cannot convert from 'const char [566]' to 'std::basic_string'

问题 I have json that needs to test a string with a utf16 wide char in it but I get the following error message: \..\test\TestClass.cpp(617): error C2440: 'initializing' : cannot convert from 'const char [566]' to 'std::basic_string<_Elem,_Traits,_Ax>' with 2> [ 2> _Elem=wchar_t, 2> _Traits=std::char_traits<wchar_t>, 2> _Ax=std::allocator<wchar_t> 2> ] 2> No constructor could take the source type, or constructor overload resolution was ambiguous This is my json: static std::wstring& BAD_JSON5

Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

阅读更多关于 Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

问题 I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing. Would be nice to have a command line one-liner like: perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt Hex dump of input for testing (two lines:

why does mbstowcs return “invalid multibyte character”

阅读更多关于 why does mbstowcs return “invalid multibyte character”

问题 "קמ"ד חיר!" is the input string copy pasted from a print of the variable in gdb. Calling mbstowcs returns -1 with the other input as NULL. Any ideas on what's wrong/how to fix this? "\327\247\327\236"\327\223 \327\227\327\231\327\250!\000\000\000" is the string with non ascii characters in octal The programs locale is C. 回答1: The mbtowcs function doesn't handle UTF-8 encoding, there isn't a locale you can set to have it translate UTF-8 to wchar_t. Therefore, I'll use Windows examples but the

UTF-16 to Ascii ignoring characters with decimal value greater than 127

阅读更多关于 UTF-16 to Ascii ignoring characters with decimal value greater than 127

问题 I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python. Please let me know your comments on how I can improve them for faster processing. try: # conversion to ascii if utf16 data is formatted correctly input = open(filename).read().decode('UTF16')