surrogate-pairs | 易学教程

Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

阅读更多关于 Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

问题 For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF via "surrogate pairs". This means that the following code is not going to work: NSString myString = @"𠬠"; uint32_t codepoint = [myString characterAtIndex:0]; printf("%04x\n", codepoint); // incorrectly prints "d842" Now, this code works 100% of the time, but it's ridiculously verbose: NSString myString = @"𠬠"; uint32_t codepoint; [@"𠬠" getBytes:&codepoint maxLength:4 usedLength:nil

How to reverse a string that contains surrogate pairs

阅读更多关于 How to reverse a string that contains surrogate pairs

问题 I have written this method to reverse a string public string Reverse(string s) { if(string.IsNullOrEmpty(s)) return s; TextElementEnumerator enumerator = StringInfo.GetTextElementEnumerator(s); var elements = new List<char>(); while (enumerator.MoveNext()) { var cs = enumerator.GetTextElement().ToCharArray(); if (cs.Length > 1) { elements.AddRange(cs.Reverse()); } else { elements.AddRange(cs); } } elements.Reverse(); return string.Concat(elements); } Now, I don't want to start a discussion

Python: getting correct string length when it contains surrogate pairs

阅读更多关于 Python: getting correct string length when it contains surrogate pairs

问题 Consider the following exchange on IPython: In [1]: s = u'華袞與緼𦅷同歸' In [2]: len(s) Out[2]: 8 The correct output should have been 7 , but because the fifth of these seven Chinese characters has a high Unicode code-point, it is represented in UTF-8 by a "surrogate pair", rather than just one simple codepoint, and as a result Python thinks it is two characters rather than one. Even if I use unicodedata , which returns the surrogate pair correctly as a single codepoint ( \U00026177 ), when passed

Python: getting correct string length when it contains surrogate pairs

阅读更多关于 Python: getting correct string length when it contains surrogate pairs

How to convert surrogate pair to Unicode scalar in Swift

阅读更多关于 How to convert surrogate pair to Unicode scalar in Swift

问题 The following example is taken from the Strings and Characters documentation: The values 55357 ( U+D83D in hex) and 56374 ( U+DC36 in hex) are the surrogate pairs that form the Unicode scalar U+1F436 , which is the DOG FACE character. Is there any way to go the other direction? That is, can I convert a surrogate pair into a scalar? I tried let myChar: Character = "\u{D83D}\u{DC36}" but I got an "Invalid Unicode scalar" error. This Objective C answer and this project seem to be custom

Detecting and Retrieving codepoints and surrogates from a Delphi String

阅读更多关于 Detecting and Retrieving codepoints and surrogates from a Delphi String

问题 I am trying to better understand surrogate pairs and Unicode implementation in Delphi. If I call length() on the Unicode string S := 'Ĥà̲V̂e' in Delphi, I will get back, 8. This is because the lengths of the individual characters [Ĥ],[à̲],[V̂], and [e] are 2, 3, 2, and 1 respectively. This is because Ĥ has a surrogate, à̲ has two additional surrogates, V̂ has a surrogate and e has no surrogates. If I wanted to return the second element in the string including all surrogates, [à̲], how

Java Can't Open a File with Surrogate Unicode Values in the Filename?

阅读更多关于 Java Can't Open a File with Surrogate Unicode Values in the Filename?

问题 I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM can't seem to locate the file. For example, my test file is: "草鷗外.gif" which gets broken into the Java characters \u8349\uD85B\uDFF6\u9DD7\u5916.gif If I create a file from this filename, I can't open it because I get a FileNotFound exception. Even

Python: Find equivalent surrogate pair from non-BMP unicode char

阅读更多关于 Python: Find equivalent surrogate pair from non-BMP unicode char

问题 The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16') ). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting '\U0001f64f' (🙏) back to '\ud83d\ude4f' . I couldn't find a clear answer to that. 回答1:

c++: How to support surrogate characters in utf8

阅读更多关于 c++: How to support surrogate characters in utf8

问题 We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs. I have read somewhere that Surrogate characters are not supported in utf-8. Is it true? If yes, what are the steps to make my application to have the default encoding of utf-16 rather than being utf-8? I don't have code snippet as the entire application is written by keeping utf-8 in mind and not surrogate characters.

Is there encoding in Unicode where every “character” is just one code point?

阅读更多关于 Is there encoding in Unicode where every “character” is just one code point?

问题 Trying to rephrase: Can you map every combining character combination into one code point? I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct? Is this true for Basic Multilingual Plane also? 回答1: If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4