utf-16

UTF-16 to ASCII conversion in Java

。_饼干妹妹 提交于 2019-11-27 02:26:59
问题 Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently). Thanks! 回答1: How about this: String input = ... // my UTF-16

How do I encode/decode UTF-16LE byte arrays with a BOM?

无人久伴 提交于 2019-11-27 02:12:16
问题 I need to encode/decode UTF-16 byte arrays to and from java.lang.String . The byte arrays are given to me with a Byte Order Marker (BOM), and I need to encoded byte arrays with a BOM. Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world. As an example, here is a method which

Correctly reading a utf-16 text file into a string without external libraries?

旧城冷巷雨未停 提交于 2019-11-27 01:22:13
I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here: I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to

grepping binary files and UTF16

自作多情 提交于 2019-11-27 01:05:43
Standard grep / pcregrep etc. can conveniently be used with binary files for ASCII or UTF8 data - is there a simple way to make them try UTF16 too (preferably simultaneously, but instead will do)? Data I'm trying to get is all ASCII anyway (references in libraries etc.), it just doesn't get found as sometimes there's 00 between any two characters, and sometimes there isn't. I don't see any way to get it done semantically, but these 00s should do the trick, except I cannot easily use them on command line. The easiest way is to just convert the text file to utf-8 and pipe that to grep: iconv -f

javascript and string manipulation w/ utf-16 surrogate pairs

泪湿孤枕 提交于 2019-11-27 00:43:40
问题 I'm working on a twitter app and just stumbled into the world of utf-8(16). It seems the majority of javascript string functions are as blind to surrogate pairs as I was. I've got to recode some stuff to make it wide character aware. I've got this function to parse strings into arrays while preserving the surrogate pairs. Then I'll recode several functions to deal with the arrays rather than strings. function sortSurrogates(str){ var cp = []; // array to hold code points while(str.length){ //

Manually converting unicode codepoints into UTF-8 and UTF-16

允我心安 提交于 2019-11-26 23:51:08
I have a university programming exam coming up, and one section is on unicode. I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help. The question will be something like: The string 'mЖ丽' has these unicode codepoints U+006D , U+0416 and U+4E3D , with answers written in hexadecimal, manually encode the string into UTF-8 and UTF-16. Any help at all will be greatly appreciated as I am trying to get my head round this. Wow. On the one hand I'm thrilled to know that university courses are teaching to the

Inno Setup Pascal Script - Reading UTF-16 file

匆匆过客 提交于 2019-11-26 23:24:48
问题 I have an .inf file exported from Resource Hacker. The file is in UTF-16 LE encoding. EXTRALARGELEGENDSII_INI TEXTFILE "Data.bin" LARGEFONTSLEGENDSII_INI TEXTFILE "Data_2.bin" NORMALLEGENDSII_INI TEXTFILE "Data_3.bin" THEMES_INI TEXTFILE "Data_4.bin" When I load it using the LoadStringFromFile function: procedure LoadResources; var RESOURCE_INFO: AnsiString; begin LoadStringFromFile(ExpandConstant('{tmp}\SKINRESOURCE - INFO.inf'), RESOURCE_INFO); Log(String(RESOURCE_INFO)); end; I am getting

Byte and char conversion in Java

我们两清 提交于 2019-11-26 22:21:23
If I convert a character to byte and then back to char , that character mysteriously disappears and becomes something else. How is this possible? This is the code: char a = 'È'; // line 1 byte b = (byte)a; // line 2 char c = (char)b; // line 3 System.out.println((char)c + " " + (int)c); Until line 2 everything is fine: In line 1 I could print "a" in the console and it would show "È". In line 2 I could print "b" in the console and it would show -56, that is 200 because byte is signed. And 200 is "È". So it's still fine. But what's wrong in line 3? "c" becomes something else and the program

D 语言字符串的故事

非 Y 不嫁゛ 提交于 2019-11-26 22:15:09
背景知识:了解字符编码的基础知识 实在受不了 Andrei 讲故事的能力,俺决定按照自己的思路来诠释 D 语言中的字符串,顺便兑现先前之承诺。本文部分资料来自《The D Programming Language》字符串章节。 文字处理真是太重要了,以至于大多数高级编程语言都会特别对待之。D 语言亦不例外。在步入正题之前,咱们先掰扯一下文字处理的背景。 很久以前,大多数计算机主要说英语。为了方便交流,ANSI 还制订了一个通用的字符集。这个字符集包括了大小写英文字母、阿拉伯数字、标点符号、控制字符等 128 个字符,且和 128 个数字做了一一映射。这就是著名的 ASCII 。如图1所示: 图1 ASCII 码表 在以二进制为基础的计算机环境中,128 个字符只需要 1 个字节的 7 位即可表示。这样就产生了 7 位的编码格式。 后来,计算机的使用越来越广泛。为此,它必须支持更多的语言。只要看一眼图1,你就会明白 ASCII 编码格式根本没法表示除英语之外的大多数语言。为了突破这个限制,人们开始在 ASCII 编码第 8 位(保留位)上动脑筋。由此,噩梦开始了。 出于兼容 ASCII 的考虑,人们把编码值 32~127 (即 0x20~0x7F) 之间的字符集保持不变。而在编码值 128~255 (即 0x80~0xFF) 之间新增了 128 个扩展字符。随之再赋以不同的代码页

utfcpp and Win32 wide API

被刻印的时光 ゝ 提交于 2019-11-26 21:56:28
问题 Is it good/safe/possible to use the tiny utfcpp library for converting everything I get back from the wide Windows API (FindFirstFileW and such) to a valid UTF8 representation using utf16to8? I would like to use UTF8 internally, but am having trouble getting the correct output (via wcout after another conversion or plain cout). Normal ASCII characters work of course, but ñä gets messed up. Or is there an easier alternative? Thanks! UPDATE: Thanks to Hans (below), I now have an easy UTF8<-