utf-16 | 易学教程

u16string and char16_t in Android NDK

阅读更多关于 u16string and char16_t in Android NDK

问题 I wish to create ( std::getline() ) and manipulate UTF-16 strings in the Android NDK, so that I can pass them (relatively) painlessly back to Java for display. Currently, I'm using C++0x, using the LOCAL_CPPFLAGS := -std=c++0x switch, which works (I'm using some other 0x functions). Seems the compiler can't find u16string . I've included <string> , and get no other errors. I wish to do something such as: ifstream file(fileName); if(!file.is_open()) { return false; } while(!file.eof()) {

u16string and char16_t in Android NDK

阅读更多关于 u16string and char16_t in Android NDK

How to convert from utf-16 to utf-32 on Linux with std library?

阅读更多关于 How to convert from utf-16 to utf-32 on Linux with std library?

问题 On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv (preferrably using conversion tools of std library)? 回答1: Decoding UTF-16 into UTF-32 is extremely easy. You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the

fatal error: high- and low-surrogate code points are not valid Unicode scalar values [duplicate]

阅读更多关于 fatal error: high- and low-surrogate code points are not valid Unicode scalar values [duplicate]

问题 This question already has answers here : How can I generate a random unicode character in Swift? (2 answers) Closed 4 years ago . Sometimes while initializing a UnicodeScalar with a value like 57292 yields the following error: fatal error: high- and low-surrogate code points are not valid Unicode scalar values What is this error, why does it occur and how can I prevent it in the future? 回答1: Background: UTF-16 represents a sequence of Unicode characters ("code points") as a sequence of 16-bit

What is the difference between “UTF-16” and “std::wstring”?

阅读更多关于 What is the difference between “UTF-16” and “std::wstring”?

问题 Is there any difference between these two string storage formats? 回答1: std::wstring is a container of wchar_t . The size of wchar_t is not specified—Windows compilers tend to use a 16-bit type, Unix compilers a 32-bit type. UTF-16 is a way of encoding sequences of Unicode code points in sequences of 16-bit integers. Using Visual Studio, if you use wide character literals (e.g. L"Hello World" ) that contain no characters outside of the BMP, you'll end up with UTF-16, but mostly the two

How can I convert UTF-16 to UTF-32 in java?

阅读更多关于 How can I convert UTF-16 to UTF-32 in java?

问题 I have looked for solutions, but there doesn't seem to be much on this topic. I have found solutions that suggest: String unicodeString = new String("utf8 here"); byte[] bytes = String.getBytes("UTF8"); String converted = new String(bytes,"UTF16"); for converting to utf16 from utf8, however, java doesn't handle "UTF32", which makes this solution unviable. Does anyone know any other way on how to achieve this? 回答1: Java does handle UTF-32, try this test byte[] a = "1".getBytes("UTF-32");

What is the Unicode U+001A Character? Aka 0x1A

阅读更多关于 What is the Unicode U+001A Character? Aka 0x1A

问题 The U+001A character appears frequently in error messages relating to character encoding. What is the U+001A character? 回答1: U+001A is defined in the Unicode Standard as a control character with the name SUBSTITUTE, and it belongs to a group characterized as follows, in chapter 16 of the standard: “There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework [...] The Unicode Standard provides for the

Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

阅读更多关于 Is the [0xff, 0xfe] prefix required on utf-16 encoded strings?

问题 Rewritten question! I am working with a vendor's device that requires "unicode encoding" of strings, where each character is represented in two bytes. My strings will always be ASCII based, so I thought this would be the way to translate my string into the vendor's string: >>> b1 = 'abc'.encode('utf-16') But examining the result, I see that there's a leading [0xff, 0xfe] on the bytearray: >>> [hex(b) for b in b1] ['0xff', '0xfe', '0x61', '0x0', '0x62', '0x0', '0x63', '0x0'] Since the vendor's

Python - Python 3.1 can't seem to handle UTF-16 encoded files?

阅读更多关于 Python - Python 3.1 can't seem to handle UTF-16 encoded files?

问题 I'm trying to run some code to simply go through a bunch of files and write those that happen to be .txt files into the same file, removing all the spaces. Here's some simple code that should do the trick: for subdir, dirs, files in os.walk(rootdir): for file in files: if '.txt' in file: f = open(subdir+'/'+file, 'r') line = f.readline() while line: line2 = line.split() if line2: output_file.write(" ".join(line2)+'\n') line = f.readline() f.close() But instead, I get the following error: File

Why must I specify charset attributes for by <script> tags?

阅读更多关于 Why must I specify charset attributes for by tags?

问题 I have a bit of an odd situation: Main HTML page is served in UTF-16 character set (due to some requirements out-of-scope for this question) HTML page uses <script> tags to load external scripts (i.e. they have src attributes) Those external scripts are in US-ASCII/UTF-8 The web server is serving the scripts with the content-type "application/javascript" with no character set hints The scripts have no byte-order-mark (BOM) When loading the page described above, both Firefox and Chrome