utf-16

Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

核能气质少年 提交于 2019-12-06 03:24:50
问题 I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me: const wchar_t* s1 = L"a"; const wchar_t* s2 = L"\U0002008A"; // The "Han" character int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows. int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess). int i3 = sizeof(s2); // i3 == 4, why? The U+2008A is the Han

How to convert a utf16 ushort array to a utf8 std::string?

馋奶兔 提交于 2019-12-06 01:59:43
问题 Currently I'm writing a plugin which is just a wrapper around an existing library. The plugin's host passes to me an utf-16 formatted string defined as following typedef unsigned short PA_Unichar; And the wrapped library accepts only a const char* or a std::string utf-8 formatted string I tried writing a conversion function like std::string toUtf8(const PA_Unichar* data) { std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert; return std::string(convert.to_bytes(static_cast

C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

两盒软妹~` 提交于 2019-12-06 00:43:40
问题 I've seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this. Given a UTF-8 string, how many bytes are needed for the UTF-16 encoding of the same string. Assume the UTF-8 string has already been validated. It has no BOM, no overlong sequences, no invalid sequences, is null-terminated. It is not CESU-8. Full UTF-16 with surrogates must be supported. Specifically I wonder if there are shortcuts to

Is there encoding in Unicode where every “character” is just one code point?

落花浮王杯 提交于 2019-12-06 00:32:02
Trying to rephrase: Can you map every combining character combination into one code point? I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct? Is this true for Basic Multilingual Plane also? If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but

Remove accents in string except “ñ”

对着背影说爱祢 提交于 2019-12-06 00:14:10
问题 I have the following example code: var inputString = "ñaáme"; inputString = inputString.Replace('ñ', '\u00F1'); var normalizedString = inputString.Normalize(NormalizationForm.FormD); var result = Regex.Replace(normalizedString, @"[^ñÑa-zA-Z0-9\s]*", string.Empty); return result.Replace('\u00F1', 'ñ'); // naame :( I need to normalize the text without removing the "ñ"s I followed this example But it's for Java and it has not worked for me I want your result to be: "ñaame". 回答1: You may match

How to save Excel file as csv with UTF-16 formatting

自闭症网瘾萝莉.ら 提交于 2019-12-05 21:30:12
I am having an issue with Excel not saving my files properly. I have a list of data which is organized into three columns: String String INt. I want to read this file into a Java program to perform some calculations. Excel exporting as a .csv file causes me to lose significant data as a result of the native UTF-8 encoding. I can save it as a UTF-16 .txt file however, I get another annoying result. If i insert columns of commas between each field field it saves the commas with quotes around it! I have seen some solutions to this problem but they do not preserve the UTF-16 encoding. Any help

How can I match emoji with an R regex?

你说的曾经没有我的故事 提交于 2019-12-05 20:46:03
问题 I want to determine which elements of my vector contain emoji: x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사') x # [1] "\U0001f602" "no" "\U0001f379" "\U0001f600" "no" "\U0001f61b" "䨺" "감사" Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R: What is the regex to extract all the emojis from a string? How do I remove emoji from string replace emoji unicode symbol using regexp in javascript Regular

UTF-16 perl input output

不羁的心 提交于 2019-12-05 16:53:24
I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file. use open "encoding(UTF-16)"; open INPUT, "< input.txt" or die "cannot open > input.txt: $!\n"; open(OUTPUT,"> output.txt"); while(<INPUT>) { print OUTPUT "$_\n" } Let's just say that my program writes everything from input.txt into output.txt. This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread-multi-64int" But in my Windows environment, which is using "This is perl 5, version 12, subversion 3 (v5

UCS2 vs UTF. What languages can not be displayed in the UCS2 encoding?

半城伤御伤魂 提交于 2019-12-05 16:48:08
UCS2 easier to use in Visual C++, than UTF encoding. What languages I can not support in UCS2 encoding? Nothing you're likely to care about or, more to the point, have fonts for. UCS2 gives you the Basic Multilingual Plane; you can find overviews of the assigned planes on the Unicode site 0 - Basic Multilingual Plane 1 - Supplementary Multilingual Plane (ancient symbols, Klingon, etc.) 2 - Supplementary Ideagraphic Plane (CJK unified ideographs extensions) 3 - Tertiary Ideographic Plane (ancient Chinese characters) 14 - Supplementary Special-Purpose Plane (tag characters and variations - ?) Of

UnicodeDecodeError on byte type

谁都会走 提交于 2019-12-05 15:41:47
Using Python 3.4 I'm getting the following error when trying to decode a byte type using utf-32 Traceback (most recent call last): File "c:.\SharqBot.py", line 1130, in <module> fullR=s.recv(1024).decode('utf-32').split('\r\n') UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: codepoint not in range(0x110000) and the following when trying to decode it into utf-16 File "c:.\SharqBot.py", line 1128, in <module> fullR=s.recv(1024).decode('utf-16').split('\r\n') UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 374: truncated data When I decode using