utf-16 | 易学教程

Convert C++ std::string to UTF-16-LE encoded string

阅读更多关于 Convert C++ std::string to UTF-16-LE encoded string

问题 I've been searching for hours today and just can't find anything that works out for me. The one I've just had a look at, with no luck, is " How to convert UTF-8 encoded std::string to UTF-16 std::string ". My question is, with a brief explanation: I want to make a valid NTLM hash in std C++, and I'm using OpenSSL's library to create the hash using its MD4 routines. I know how to do that, so does anyone know how to convert the std::string into a UTF-16 LE encoded string which I can pass to the

JavaScript strings - UTF-16 vs UCS-2?

阅读更多关于 JavaScript strings - UTF-16 vs UCS-2?

I've read in some places that JavaScript strings are UTF-16, and in other places they're UCS-2. I did some searching around to try to figure out the difference and found this: Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided. UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code

UTF-16 to UTF-8 conversion (for scripting in Windows)

阅读更多关于 UTF-16 to UTF-8 conversion (for scripting in Windows)

问题 what is the best way to convert a UTF-16 files to UTF-8? I need to use this in a cmd script. 回答1: There is a GNU tool recode which you can also use on Windows. E.g. recode utf16..utf8 text.txt 回答2: An alternative to Ruby would be to write a small .NET program in C# (.NET 1.0 would be fine, although 2.0 would be simpler :) - it's a pretty trivial bit of code. Were you hoping to do it without any other applications at all? If you want a bit of code to do it, add a comment and I'll fill in the

Java charAt used with characters that have two code units

阅读更多关于 Java charAt used with characters that have two code units

问题 From Core Java , vol. 1, 9th ed., p. 69: The character ℤ requires two code units in the UTF-16 encoding. Calling String sentence = "ℤ is the set of integers"; // for clarity; not in book char ch = sentence.charAt(1) doesn't return a space but the second code unit of ℤ. But it seems that sentence.charAt(1) does return a space. For example, the if statement in the following code evaluates to true . String sentence = "ℤ is the set of integers"; if (sentence.charAt(1) == ' ') System.out.println(

Convert UTF-16 to UTF-8 and remove BOM?

阅读更多关于 Convert UTF-16 to UTF-8 and remove BOM?

We have a data entry person who encoded in UTF-16 on Windows and would like to have utf-8 and remove the BOM. The utf-8 conversion works but BOM is still there. How would I remove this? This is what I currently have: batch_3={'src':'/Users/jt/src','dest':'/Users/jt/dest/'} batches=[batch_3] for b in batches: s_files=os.listdir(b['src']) for file_name in s_files: ff_name = os.path.join(b['src'], file_name) if (os.path.isfile(ff_name) and ff_name.endswith('.json')): print ff_name target_file_name=os.path.join(b['dest'], file_name) BLOCKSIZE = 1048576 with codecs.open(ff_name, "r", "utf-16-le")

Why does the Java char primitive take up 2 bytes of memory?

阅读更多关于 Why does the Java char primitive take up 2 bytes of memory?

Is there any reason why Java char primitive data type is 2 bytes unlike C which is 1 byte? Thanks When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic Multilingual Plane (the most common ones) still use 1. A Java char is used for each code unit. This Sun article explains it well. char in Java is UTF-16 encoded,

Convert UTF-16 to UTF-8 under Windows and Linux, in C

阅读更多关于 Convert UTF-16 to UTF-8 under Windows and Linux, in C

I was wondering if there is a recommended 'cross' Windows and Linux method for the purpose of converting strings from UTF-16LE to UTF-8? or one should use different methods for each environment? I've managed to google few references to 'iconv' , but for somreason I can't find samples of basic conversions, such as - converting a wchar_t UTF-16 to UTF-8. Anybody can recommend a method that would be 'cross', and if you know of references or a guide with samples, would very appreciate it. Thanks, Doori Bar If you don't want to use ICU, Windows: WideCharToMultiByte Linux: iconv (Glibc) user4657497

Encode/Decode std::string to UTF-16

阅读更多关于 Encode/Decode std::string to UTF-16

问题 I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded). I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format

UTF-16 Encoding in Java versus C#

阅读更多关于 UTF-16 Encoding in Java versus C#

问题 I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it. The following is the piece of code in Java : public static void main(String[] args) { String str = "preparar mantecado con coca cola"; try { MessageDigest digest = MessageDigest.getInstance("MD5"); digest.update(str.getBytes("UTF-16")); byte[] hash = digest.digest(); String output = ""; for(byte b: hash){ output += Integer

刨根究底字符编码之八——Unicode编码方案概述

阅读更多关于刨根究底字符编码之八——Unicode编码方案概述

Unicode编码方案概述 1. 前面讲过，随着计算机发展到世界各地，于是各个国家和地区各自为政，搞出了很多既兼容ASCII但又互相不兼容的各种编码方案。这样一来同一个二进制编码就有可能被解释成不同的字符，导致不同的字符集在交换数据时带来极大的不便。比如大陆和台湾是只相隔150海里、使用着同一种语言的兄弟地区，也分别采用了不同的DBCS双字节字符集编码方案。以前大陆地区必须装上类似于“UCDOS希望汉字系统”这样的中文处理系统专门来处理简体汉字的显示、输入问题。而台湾地区由于采用BIG5编码方案(统一繁体字编码，俗称大五码，使用2个字节表示繁体汉字)，则必须安装类似于“ET倚天汉字系统”这样的繁体中文处理系统才可以正确显示、输入繁体汉字。因此，要想打开一个文本文件，就必须首先知道它所采用的编码方案，否则用错误的编码方案进行解码，就会出现乱码。为什么电子邮件常常出现乱码？就是因为发信人和收信人使用的编码方案不一样。 2. 想象一下，如果有一种统一的编码方案，将世界上所有语言字符都纳入其中，每一个字符都给予一个全球独一无二的编码，那么乱码问题就会消失。于是，全球所有国家和民族使用的所有语言字符的统一编码方案诞生了。最初，由多语言软件制造商组成了统一码联盟( The Unicode Consortium )，然后于1991年发布了The Unicode Standard