utf-16

Convert C++ std::string to UTF-16-LE encoded string

强颜欢笑 提交于 2019-11-27 07:37:50
问题 I've been searching for hours today and just can't find anything that works out for me. The one I've just had a look at, with no luck, is " How to convert UTF-8 encoded std::string to UTF-16 std::string ". My question is, with a brief explanation: I want to make a valid NTLM hash in std C++, and I'm using OpenSSL's library to create the hash using its MD4 routines. I know how to do that, so does anyone know how to convert the std::string into a UTF-16 LE encoded string which I can pass to the

JavaScript strings - UTF-16 vs UCS-2?

余生颓废 提交于 2019-11-27 07:26:56
I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this: Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided. UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code

UTF-16 to UTF-8 conversion (for scripting in Windows)

血红的双手。 提交于 2019-11-27 05:58:01
问题 what is the best way to convert a UTF-16 files to UTF-8? I need to use this in a cmd script. 回答1: There is a GNU tool recode which you can also use on Windows. E.g. recode utf16..utf8 text.txt 回答2: An alternative to Ruby would be to write a small .NET program in C# (.NET 1.0 would be fine, although 2.0 would be simpler :) - it's a pretty trivial bit of code. Were you hoping to do it without any other applications at all? If you want a bit of code to do it, add a comment and I'll fill in the

Java charAt used with characters that have two code units

馋奶兔 提交于 2019-11-27 04:37:16
问题 From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println(

Convert UTF-16 to UTF-8 and remove BOM?

淺唱寂寞╮ 提交于 2019-11-27 04:24:19
We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have: batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'} batches=[batch_3] for b in batches: s_files=os.listdir(b['src']) for file_name in s_files: ff_name = os.path.join(b['src'], file_name) if (os.path.isfile(ff_name) and ff_name.endswith('.json')): print ff_name target_file_name=os.path.join(b['dest'], file_name) BLOCKSIZE = 1048576 with codecs.open(ff_name, "r", "utf-16-le")

Why does the Java char primitive take up 2 bytes of memory?

纵然是瞬间 提交于 2019-11-27 03:57:17
Is there any reason why Java char primitive data type is 2 bytes unlike C which is 1 byte? Thanks When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic Multilingual Plane (the most common ones) still use 1. A Java char is used for each code unit. This Sun article explains it well. char in Java is UTF-16 encoded,

Convert UTF-16 to UTF-8 under Windows and Linux, in C

我的梦境 提交于 2019-11-27 03:41:13
I was wondering if there is a recommended 'cross' Windows and Linux method for the purpose of converting strings from UTF-16LE to UTF-8? or one should use different methods for each environment? I've managed to google few references to 'iconv' , but for somreason I can't find samples of basic conversions, such as - converting a wchar_t UTF-16 to UTF-8. Anybody can recommend a method that would be 'cross', and if you know of references or a guide with samples, would very appreciate it. Thanks, Doori Bar If you don't want to use ICU, Windows: WideCharToMultiByte Linux: iconv (Glibc) user4657497

Encode/Decode std::string to UTF-16

纵饮孤独 提交于 2019-11-27 02:58:27
问题 I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded). I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format

UTF-16 Encoding in Java versus C#

三世轮回 提交于 2019-11-27 02:52:19
问题 I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it. The following is the piece of code in Java : public static void main(String[] args) { String str = "preparar mantecado con coca cola"; try { MessageDigest digest = MessageDigest.getInstance("MD5"); digest.update(str.getBytes("UTF-16")); byte[] hash = digest.digest(); String output = ""; for(byte b: hash){ output += Integer

刨根究底字符编码之八——Unicode编码方案概述

六眼飞鱼酱① 提交于 2019-11-27 02:36:27
Unicode编码方案概述 1. 前面讲过,随着计算机发展到世界各地,于是各个国家和地区各自为政,搞出了很多既兼容ASCII但又互相不兼容的各种编码方案。这样一来同一个二进制编码就有可能被解释成不同的字符,导致不同的字符集在交换数据时带来极大的不便。 比如大陆和台湾是只相隔150海里、使用着同一种语言的兄弟地区,也分别采用了不同的DBCS双字节字符集编码方案。 以前大陆地区必须装上类似于“UCDOS希望汉字系统”这样的中文处理系统专门来处理简体汉字的显示、输入问题。 而台湾地区由于采用BIG5编码方案(统一繁体字编码,俗称大五码,使用2个字节表示繁体汉字),则必须安装类似于“ET倚天汉字系统”这样的繁体中文处理系统才可以正确显示、输入繁体汉字。 因此,要想打开一个文本文件,就必须首先知道它所采用的编码方案,否则用错误的编码方案进行解码,就会出现乱码。为什么电子邮件常常出现乱码?就是因为发信人和收信人使用的编码方案不一样。 2. 想象一下,如果有一种统一的编码方案,将世界上所有语言字符都纳入其中,每一个字符都给予一个全球独一无二的编码,那么乱码问题就会消失。于是,全球所有国家和民族使用的所有语言字符的统一编码方案诞生了。 最初,由多语言软件制造商组成了统一码联盟( The Unicode Consortium ),然后于1991年发布了The Unicode Standard