utf-16

Is there a Rust library with an UTF-16 string type? (intended for writing a Javascript interpreter)

感情迁移 提交于 2019-12-08 19:53:11
问题 For most programs, it's better to use UTF-8 internally and, when necessary, convert to other encodings. But in my case, I want to write a Javascript interpreter, and it's much simpler to store only UTF-16 strings (or arrays of u16 ), because I need to address 16-bits code units individually (this is a bad idea in general, but Javascript requires this). This means I need it to implement Index<usize> . I need to store unpaired surrogates, that is, malformed UTF-16 strings (because of this,

XML Spec and UTF-16

雨燕双飞 提交于 2019-12-08 17:19:00
问题 Section 4.3.3 and Appendix F of the XML 1.0 spec speak about UTF-16, the byte order mark (BOM) in UTF-16 encoded data streams, and the XML encoding declaration. From the information in those sections, it would seem that a byte order mark is required in UTF-16 documents. But the summary chart in Appendix F gives a scenario where a UTF-16 input does not have a Byte order mark, but this scenario has an xml declaration. According to section 4.3.3, a UTF-16 encoded document does not require an

Can wprintf output be properly redirected to UTF-16 on Windows?

心不动则不痛 提交于 2019-12-08 14:52:23
问题 In a C program I'm using wprintf to print Unicode (UTF-16) text in a Windows console. This works fine, but when the output of the program is redirected to a log file, the log file has a corrupted UTF-16 encoding. When redirection is done in a Windows Command Prompt, all line breaks are encoded as a narrow ASCII line break (0d0a). When redirection is done in PowerShell, null characters are inserted. Is it possible to redirect the output to a proper UTF-16 log file? Example program: #include

C++ UTF-16 to char conversion (Linux/Ubuntu)

主宰稳场 提交于 2019-12-08 11:13:04
问题 I am trying to help a friend with a project that was supposed to be 1H and has been now 3 days. Needless to say I feel very frustrated and angry ;-) ooooouuuu... I breath. So the program written in C++ just read a bunch of file and process them. The problem is that my program reads files which are using a UTF-16 encoding (because the files contain words written in different languages) and a simple use to ifstream just doesn't seem to work (it reads and outputs garbage). It took me a while to

c++: How to support surrogate characters in utf8

限于喜欢 提交于 2019-12-08 10:17:20
问题 We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs. I have read somewhere that Surrogate characters are not supported in utf-8. Is it true? If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8? I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

Reading a UTF-16 CSV file by char

流过昼夜 提交于 2019-12-08 09:07:46
问题 Currently I am trying to read a UTF-16 encoded CSV file char by char, and convert each char into ascii so I can process it. I later plan to change my processed data back to UTF-16 but that is besides the point right now. I know right off the bat I am doing this completely wrong, as I have never attempted anything like this before: int main(void) { FILE *fp; int ch; if(!(fp = fopen("x.csv", "r"))) return 1; while(ch != EOF) { ch = fgetc(fp); ch = (wchar_t) ch; ch = (char) ch; printf("%c", ch);

Why does Rails 3 think xE2x80x89 means â x80 x89

喜夏-厌秋 提交于 2019-12-08 08:14:57
问题 I have a field scraped from a utf-8 page: "O’Reilly" And saved in a yml file: :name: "O\xE2\x80\x99Reilly" (xE2x80x99 is the correct UTF-8 representation of this apostrophe) However when I load the value into a hash and yield it to a page tagged as utf-8, I get: OâReilly I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16

Parsing csv file with english and hindi characters in python

痞子三分冷 提交于 2019-12-08 03:47:17
问题 I am trying to parse a csv file which has both english and hindi characters and I am using utf-16. It works fine but as soon as it hits the hindi charatcer it fails. I am at a loss here. Heres the code --> import csv import codecs csvReader = csv.reader(codecs.open('/home/kuberkaul/Downloads/csv.csv', 'rb', 'utf-16')) for row in csvReader: print row The error that I get is Traceback (most recent call last): > File "csvreader.py", line 8, in <module> > for row in csvReader: UnicodeEncodeError:

Java String internal representation

空扰寡人 提交于 2019-12-08 03:02:15
问题 I understand that the internal representation of Java for String is UTF-16. What is java string representation? Also, I know that in a UTF-16 String, each 'character' is encoded with one or two 16-bit code units. However, when I debug the following java code String hello = "Hello"; the variable hello is an array of 5 bytes 0x48, 0x101, 0x108, 0x108, 0x111 which is ASCII for "Hello". How can this be? 回答1: I took a gcore dump of a mini java process with this code: class Hi { public static void

how to convert utf8 to std::string?

喜欢而已 提交于 2019-12-08 01:55:14
问题 I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet: typedef std::wstring string_t; //defined in basic_types.h in cpprest lib void demo() { http_response response; //code to handle respose ... json::value output= response.extract_json(); string_t payload = output.at(L"payload").as_string(); vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload); std::string utf8_payload(base64