utf-16 | 易学教程

Java Unicode String length

阅读更多关于 Java Unicode String length

I am trying hard to get the count of unicode string and tried various options. Looks like a small problem but struck in a big way. Here I am trying to get the length of the string str1. I am getting it as 6. But actually it is 3. moving the cursor over the string "குமார்" also shows it as 3 chars. Basically I want to measure the length and print each character. like "கு", "மா", "ர்" . public class one { public static void main(String[] args) { String str1 = new String("குமார்"); System.out.print(str1.length()); } } PS : It is tamil language. halex Found a solution to your problem. Based on

Utf8_general_ci or utf8mb4 or…?

阅读更多关于 Utf8_general_ci or utf8mb4 or…?

问题 utf16 or utf32? I'm trying to store content in a lot of languages. Some of the languages use double-wide fonts (for example, Japanese fonts are frequently twice as wide as English fonts). I'm not sure which kind of database I should be using. Any information about the differences between these four charsets... 回答1: MySQL's utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is

Converting UTF-16 to UTF-8

阅读更多关于 Converting UTF-16 to UTF-8

I've loading a string from a file. When I print out the string with: print my_string print binascii.hexlify(my_string) I get: 2DF5 0032004400460035 Meaning this string is UTF-16 . I would like to convert this string to UTF-8 so that the above code produces this output: 2DF5 32444635 I've tried: my_string.decode('utf-8') Which output: 32004400460035 EDIT: Here's a quick sample: hello = 'hello'.encode('utf-16') print hello print binascii.hexlify(hello) hello = hello[2:].decode('utf-8') print hello print binascii.hexlify(hello) Which produces this output: ��hello fffe680065006c006c006f00 hello

How to write 3 bytes unicode literal in Java?

阅读更多关于 How to write 3 bytes unicode literal in Java?

I'd like to write unicode literal U+10428 in Java. http://www.marathon-studios.com/unicode/U10428/Deseret_Small_Letter_Long_I I tried with '\u10428' and it doesn't compile. Because Java went full-out unicode when people thought 64K are enough for everyone (Where did one hear such before?), they started out with UCS-2 and later upgraded to UTF-16. But they never bothered to add an escape sequence for unicode characters outside the BMP. Thus, your only recourse is manually recoding to a UTF-16 surrogate-pair and using two UTF-16 escapes. Your example codepoint U+10428 is "\uD801\uDC28" . I used

How to convert string to unicode(UTF-8) string in Swift?

阅读更多关于 How to convert string to unicode(UTF-8) string in Swift?

How to convert string to unicode(UTF-8) string in Swift? In Objective I could write smth like that: NSString *str = [[NSString alloc] initWithUTF8String:[strToDecode cStringUsingEncoding:NSUTF8StringEncoding]]; how to do smth similar in Swift? Use this code, let str = String(UTF8String: strToDecode.cStringUsingEncoding(NSUTF8StringEncoding)) hope its helpful Swift 4 I have created a String extension func utf8DecodedString()-> String { let data = self.data(using: .utf8) if let message = String(data: data!, encoding: .nonLossyASCII){ return message } return "" } func utf8EncodedString()-> String

Encode/Decode std::string to UTF-16

阅读更多关于 Encode/Decode std::string to UTF-16

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded). I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that

Check if byte sequence contains utf-16

阅读更多关于 Check if byte sequence contains utf-16

问题 I am reading a byte sequence from a stream. Assume for the sake of argument, that the sequence is of a fixed length and I read the whole thing into a byte array (in my case it's vector<char> but it's not important for this question). This byte sequence contains a string, which my be either in utf-16 or in utf-8 encoding. Unfortunately, there's no indicator of which one it is. I can verify whether the byte sequence represents a valid utf-16 encoding and also whether it represents a valid utf-8

UTF-16 Encoding in Java versus C#

阅读更多关于 UTF-16 Encoding in Java versus C#

I am trying to read a String in UTF-16 encoding scheme and perform MD5 hashing on it. But strangely, Java and C# are returning different results when I try to do it. The following is the piece of code in Java : public static void main(String[] args) { String str = "preparar mantecado con coca cola"; try { MessageDigest digest = MessageDigest.getInstance("MD5"); digest.update(str.getBytes("UTF-16")); byte[] hash = digest.digest(); String output = ""; for(byte b: hash){ output += Integer.toString( ( b & 0xff ) + 0x100, 16).substring( 1 ); } System.out.println(output); } catch (Exception e) { } }

How do I encode/decode UTF-16LE byte arrays with a BOM?

阅读更多关于 How do I encode/decode UTF-16LE byte arrays with a BOM?

I need to encode/decode UTF-16 byte arrays to and from java.lang.String . The byte arrays are given to me with a Byte Order Marker (BOM) , and I need to encoded byte arrays with a BOM. Also, because I'm dealing with a Microsoft client/server, I'd like to emit the encoding in little endian (along with the LE BOM) to avoid any misunderstandings. I do realize that with the BOM it should work big endian, but I don't want to swim upstream in the Windows world. As an example, here is a method which encodes a java.lang.String as UTF-16 in little endian with a BOM: public static byte[] encodeString

Pandas read_csv and UTF-16

阅读更多关于 Pandas read_csv and UTF-16

问题 I have a CSV text file encoded in UTF-16 (so as to preserve Unicode characters when others use Excel) but when doing a read_csv with Pandas 0.9.0, I get this cryptic error: df = pd.read_csv('data.txt',encoding='utf-16',sep='\t',header=0) df.head() --------------------------------------------------------------------------- Exception Traceback (most recent call last) <ipython-input-18-85da1383cd9e> in <module>() ----> 1 df = pd.read_csv('candidates-spanish.txt',encoding='utf-16',sep='\t',header