utf-16

Struggling to convert vector<char> to wstring

Deadly 提交于 2019-12-08 01:30:17
问题 I need to convert utf16 text to utf8. The actual conversion code is simple: std::wstring in(...); std::string out = boost::locale::conv::utf_to_utf<char, wchar_t>(in); However the issue is that the UTF16 is read from a file and it may or may not contain BOM. My code needs to be portable (minimum is windows/osx/linux). I'm really struggling to figure out how to create a wstring from the byte sequence. EDIT: this is not a duplicate of the linked question, as in that question the OP needs to

Is there encoding in Unicode where every “character” is just one code point?

半城伤御伤魂 提交于 2019-12-07 18:50:17
问题 Trying to rephrase: Can you map every combining character combination into one code point? I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct? Is this true for Basic Multilingual Plane also? 回答1: If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4

UTF-16 perl input output

99封情书 提交于 2019-12-07 12:29:43
问题 I am writing a script that takes a UTF-16 encoded text file as input and outputs a UTF-16 encoded text file. use open "encoding(UTF-16)"; open INPUT, "< input.txt" or die "cannot open > input.txt: $!\n"; open(OUTPUT,"> output.txt"); while(<INPUT>) { print OUTPUT "$_\n" } Let's just say that my program writes everything from input.txt into output.txt. This WORKS perfectly fine in my cygwin environment, which is using "This is perl 5, version 14, subversion 2 (v5.14.2) built for cygwin-thread

UTF16 hex to text

给你一囗甜甜゛ 提交于 2019-12-07 12:05:28
问题 I have UTF-16 hex representation such as “0633064406270645” which is "سلام" in Arabic language. I would like to convert it to its text equivalent. Is there a straight way to do that in PostgreSQL? I can convert the UTF code point like below; unfortunately it seems UTF16 is not supported. Any ideas on how to do it in PostgreSQL, worst case I will write a function? SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8'); "سلام" SELECT convert_from (decode (E'0633064406270645', 'hex'),

UnicodeDecodeError on byte type

限于喜欢 提交于 2019-12-07 10:45:31
问题 Using Python 3.4 I'm getting the following error when trying to decode a byte type using utf-32 Traceback (most recent call last): File "c:.\SharqBot.py", line 1130, in <module> fullR=s.recv(1024).decode('utf-32').split('\r\n') UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: codepoint not in range(0x110000) and the following when trying to decode it into utf-16 File "c:.\SharqBot.py", line 1128, in <module> fullR=s.recv(1024).decode('utf-16').split('\r\n')

UCS2 vs UTF. What languages can not be displayed in the UCS2 encoding?

不想你离开。 提交于 2019-12-07 07:30:59
问题 UCS2 easier to use in Visual C++, than UTF encoding. What languages I can not support in UCS2 encoding? 回答1: Nothing you're likely to care about or, more to the point, have fonts for. UCS2 gives you the Basic Multilingual Plane; you can find overviews of the assigned planes on the Unicode site 0 - Basic Multilingual Plane 1 - Supplementary Multilingual Plane (ancient symbols, Klingon, etc.) 2 - Supplementary Ideagraphic Plane (CJK unified ideographs extensions) 3 - Tertiary Ideographic Plane

wchar_t for UTF-16 on Linux?

落花浮王杯 提交于 2019-12-07 04:54:34
问题 Does it make any sense to store UTF-16 encoded text using wchar_t* on Linux? The obvious problem is that wchar_t is four bytes on Linux and UTF-16 takes usually two (or sometimes two groups of two) bytes per character. I'm trying to use a third-party library that does exactly that and it seems very confusing. Looks like things are messed up because on Windows wchar_t is two bytes, but I just want to double check since it's a pretty expensive commercial library and may be I just don't

Locale: 字符集(character set).

◇◆丶佛笑我妖孽 提交于 2019-12-07 02:56:16
char 可被应用于所有8bit以及8bit以下的字符集,例如: US-ASCII,ISO-Latin-1和ISO-Latin-9以及UTF-8. char16_t 可被用于UCS-2,也可被用于UTF-16的code unit(代码单元). char32_t 可被用于UCS-4/UTF-32. wchar_t 它通常等价于char16_t或者char32_t. US-ASCII 7-bit字符集,于1963年完成标准化,用于电传打字机和其他设备,最开始的16个字符是不可打印的。 ISO-Latin-1或ISO-8859-1 这是一个8bit的字符集,于1987年完成标准提供西欧语言的所有字符, 这个字符集也是下面所有字符集的基础. UCS-2 这是一个 16bit的定长字符集(2byte) ,提供Universal Character Set(全球字符集)和Unicode(统一码)中最重要的65536个字符. UTF-8 这是个 multi-byte字符集 ,使用1-4个8bit值,用来表现Universal Character Set(全球字符集)和Unicode(统一码)的所有字符.主要被广泛的应用于万维网(world wide web). UTF-16 这也是一个 multi-byte字符集 ,使用1-2 code unit(每个16bit),用来表现Universal

Reading a UTF-16 CSV file by char

淺唱寂寞╮ 提交于 2019-12-06 14:48:15
Currently I am trying to read a UTF-16 encoded CSV file char by char, and convert each char into ascii so I can process it. I later plan to change my processed data back to UTF-16 but that is besides the point right now. I know right off the bat I am doing this completely wrong, as I have never attempted anything like this before: int main(void) { FILE *fp; int ch; if(!(fp = fopen("x.csv", "r"))) return 1; while(ch != EOF) { ch = fgetc(fp); ch = (wchar_t) ch; ch = (char) ch; printf("%c", ch); } fclose(fp); return 0; } Wishfully thinking, I was hoping that that work by magic for some reason but

Java String internal representation

断了今生、忘了曾经 提交于 2019-12-06 13:44:05
I understand that the internal representation of Java for String is UTF-16. What is java string representation? Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units. However, when I debug the following java code String hello = "Hello"; the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111 which is ASCII for "Hello". How can this be? I took a gcore dump of a mini java process with this code: class Hi { public static void main(String args[]) { String hello = "Hello"; try { Thread.sleep(60_000); } catch (InterruptedException e) {