surrogate-pairs | 易学教程

C# - Regular expression to find a surrogate pair of a unicode codepoint from any string?

阅读更多关于 C# - Regular expression to find a surrogate pair of a unicode codepoint from any string?

问题 I am trying to parse a message that possibly contains emojis in it. An example message that could be received looks like: {"type":"chat","msg":"UserName:\u00a0\ud83d\ude0b \n"} What should match is \u00a0 as a single character, and \ud83d\ude0b as a pair. I have regex that can pull individual codes, but not pairs to match the full emoji: \\u[a-z0-9]{4} Is there a clean way to account for any/multiple emojis in a sentence so I can replace the surrogate pair with the function I have? Thanks!

Spliting an emoji sequence in powershell

阅读更多关于 Spliting an emoji sequence in powershell

问题 I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried: function emoji_to_unicode(){ foreach ($emoji in $textbox.Text) { $unicode = [System.Text.Encoding]::Unicode.GetBytes($emoji) Write-Host $unicode } } Instead of printing the bytes one by one, the loop is running just once, printing the codes of all the emoji joined together. It's like all the emoji was a single item. I

Python can't encode with surrogateescape

阅读更多关于 Python can't encode with surrogateescape

问题 I have a problem with Unicode surrogates encoding in Python (3.4): >>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed If I'm not mistaken, according to Python documentation: 'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

阅读更多关于 What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

问题 I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters. I am able to get the index when only comparing single characters, like in the code below: public static int UnicodeIndexOf(this string input, string find) { return input.ToTextElements().ToList().IndexOf(find); } public static IEnumerable<string> ToTextElements(this string input) { var e = StringInfo.GetTextElementEnumerator(input); while (e.MoveNext()) { yield return e

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

阅读更多关于 What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

阅读更多关于 Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

问题 Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane). To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively. I'm looking for how to split a js string by

PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

阅读更多关于 PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

问题 who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs. See the following example: How it is now: $ te = u'\U0001f600\U0001f600' $ flags1 = regex.findall(".", te, re.UNICODE) $ flags1 >>> [u'\ud83d', u'\ude00', u'\ud83d', u'\ude00'] My wish: $ te = u'\U0001f600\U0001f600' $ flags1 = regex.findall(".", te, re.UNICODE) $ flags1 >>> [u'\U0001f600', u'\U0001f600'] Why am i actually need it, because i want to iterate over unicode string and get

What are surrogate characters in UTF-8?

阅读更多关于 What are surrogate characters in UTF-8?

问题 I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF . Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I

Unicode surrogate pairs

阅读更多关于 Unicode surrogate pairs

问题 Say I have a surrogate pair. For example: \u306f\u30fc Is there a function I can use to print the character to the screen? 回答1: If you want to do it manually: echo chr(0x30) . chr(0x6f) . chr(0x30) . chr(0xfc); If you have the string, you could always do: $callback = function($match) { return chr(hexdec($match[1])) . chr(hexdec($match[2])); } preg_replace_callback('#\\\\u([0-9a-f]{2})([0-9a-f]{2})#', $callback, $string); Or, if php < 5.3 $callback = create_function('$match', 'return chr

Checking for illegal surrogates in Python 3 strings

阅读更多关于 Checking for illegal surrogates in Python 3 strings

问题 Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match: re.search(r'[\uD800-\uDFFF]', s) Based on the assumption that all legal surrogates would have been represented as astral code points and thus would not match, leaving out the illegal surrogates, or is there caveats and edge cases one needs to be aware of? 回答1: Yes, that's correct. Code units 0xD800–0xDFFF don't represent valid characters in wide Unicode strings, and in Python 3.3+