utf-16 | 易学教程

Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

阅读更多关于 Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

问题 I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me: const wchar_t* s1 = L"a"; const wchar_t* s2 = L"\U0002008A"; // The "Han" character int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows. int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess). int i3 = sizeof(s2); // i3 == 4, why? The U+2008A is the Han

How to convert a utf16 ushort array to a utf8 std::string?

阅读更多关于 How to convert a utf16 ushort array to a utf8 std::string?

问题 Currently I'm writing a plugin which is just a wrapper around an existing library. The plugin's host passes to me an utf-16 formatted string defined as following typedef unsigned short PA_Unichar; And the wrapped library accepts only a const char* or a std::string utf-8 formatted string I tried writing a conversion function like std::string toUtf8(const PA_Unichar* data) { std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; return std::string(convert.to_bytes(static_cast

C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

阅读更多关于 C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

问题 I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this. Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string. Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8. Full UTF-16 with surrogates must be supported. Specifically I wonder if there are shortcuts to

Is there encoding in Unicode where every “character” is just one code point?

阅读更多关于 Is there encoding in Unicode where every “character” is just one code point?

Trying to rephrase: Can you map every combining character combination into one code point? I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct? Is this true for Basic Multilingual Plane also? If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but

Remove accents in string except “ñ”

阅读更多关于 Remove accents in string except “ñ”

问题 I have the following example code: var inputString = "ñaáme"; inputString = inputString.Replace('ñ', '\u00F1'); var normalizedString = inputString.Normalize(NormalizationForm.FormD); var result = Regex.Replace(normalizedString, @"[^ñÑa-zA-Z0-9\s]*", string.Empty); return result.Replace('\u00F1', 'ñ'); // naame :( I need to normalize the text without removing the "ñ"s I followed this example But it's for Java and it has not worked for me I want your result to be: "ñaame". 回答1: You may match

How to save Excel file as csv with UTF-16 formatting

阅读更多关于 How to save Excel file as csv with UTF-16 formatting

I am having an issue with Excel not saving my files properly. I have a list of data which is organized into three columns: String String INt. I want to read this file into a Java program to perform some calculations. Excel exporting as a .csv file causes me to lose significant data as a result of the native UTF-8 encoding. I can save it as a UTF-16 .txt file however, I get another annoying result. If i insert columns of commas between each field field it saves the commas with quotes around it! I have seen some solutions to this problem but they do not preserve the UTF-16 encoding. Any help

How can I match emoji with an R regex?

阅读更多关于 How can I match emoji with an R regex?

问题 I want to determine which elements of my vector contain emoji: x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사') x # [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사" Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R: What is the regex to extract all the emojis from a string? How do I remove emoji from string replace emoji unicode symbol using regexp in javascript Regular

UTF-16 perl input output

阅读更多关于 UTF-16 perl input output

I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file. use open "encoding(UTF-16)"; open INPUT, "< input.txt" or die "cannot open > input.txt: $!\n"; open(OUTPUT,"> output.txt"); while(<INPUT>) { print OUTPUT "$_\n" } Let's just say that my program writes everything from input.txt into output.txt. This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int" But in my Windows environment, which is using "This is perl 5, version 12, subversion 3 (v5

UCS2 vs UTF. What languages can not be displayed in the UCS2 encoding?

阅读更多关于 UCS2 vs UTF. What languages can not be displayed in the UCS2 encoding?

UCS2 easier to use in Visual C++, than UTF encoding. What languages I can not support in UCS2 encoding? Nothing you're likely to care about or, more to the point, have fonts for. UCS2 gives you the Basic Multilingual Plane; you can find overviews of the assigned planes on the Unicode site 0 - Basic Multilingual Plane 1 - Supplementary Multilingual Plane (ancient symbols, Klingon, etc.) 2 - Supplementary Ideagraphic Plane (CJK unified ideographs extensions) 3 - Tertiary Ideographic Plane (ancient Chinese characters) 14 - Supplementary Special-Purpose Plane (tag characters and variations - ?) Of

UnicodeDecodeError on byte type

阅读更多关于 UnicodeDecodeError on byte type

Using Python 3.4 I'm getting the following error when trying to decode a byte type using utf-32 Traceback (most recent call last): File "c:.\SharqBot.py", line 1130, in <module> fullR=s.recv(1024).decode('utf-32').split('\r\n') UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: codepoint not in range(0x110000) and the following when trying to decode it into utf-16 File "c:.\SharqBot.py", line 1128, in <module> fullR=s.recv(1024).decode('utf-16').split('\r\n') UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 374: truncated data When I decode using