surrogate-pairs

C# - Regular expression to find a surrogate pair of a unicode codepoint from any string?

删除回忆录丶 提交于 2021-02-08 04:27:55
问题 I am trying to parse a message that possibly contains emojis in it. An example message that could be received looks like: {"type":"chat","msg":"UserName:\u00a0\ud83d\ude0b \n"} What should match is \u00a0 as a single character, and \ud83d\ude0b as a pair. I have regex that can pull individual codes, but not pairs to match the full emoji: \\u[a-z0-9]{4} Is there a clean way to account for any/multiple emojis in a sentence so I can replace the surrogate pair with the function I have? Thanks!

Spliting an emoji sequence in powershell

陌路散爱 提交于 2020-06-26 14:12:47
问题 I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried: function emoji_to_unicode(){ foreach ($emoji in $textbox.Text) { $unicode = [System.Text.Encoding]::Unicode.GetBytes($emoji) Write-Host $unicode } } Instead of printing the bytes one by one, the loop is running just once, printing the codes of all the emoji joined together. It's like all the emoji was a single item. I

Python can't encode with surrogateescape

谁说我不能喝 提交于 2020-01-16 01:36:47
问题 I have a problem with Unicode surrogates encoding in Python (3.4): >>> b'\xCC'.decode('utf-16_be', 'surrogateescape').encode('utf-16_be', 'surrogateescape') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-16-be' codec can't encode character '\udccc' in position 0: surrogates not allowed If I'm not mistaken, according to Python documentation: 'surrogateescape': On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

孤街醉人 提交于 2020-01-11 11:51:11
问题 I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters. I am able to get the index when only comparing single characters, like in the code below: public static int UnicodeIndexOf(this string input, string find) { return input.ToTextElements().ToList().IndexOf(find); } public static IEnumerable<string> ToTextElements(this string input) { var e = StringInfo.GetTextElementEnumerator(input); while (e.MoveNext()) { yield return e

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

无人久伴 提交于 2020-01-11 11:50:08
问题 I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters. I am able to get the index when only comparing single characters, like in the code below: public static int UnicodeIndexOf(this string input, string find) { return input.ToTextElements().ToList().IndexOf(find); } public static IEnumerable<string> ToTextElements(this string input) { var e = StringInfo.GetTextElementEnumerator(input); while (e.MoveNext()) { yield return e

Split JavaScript string into array of codepoints? (taking into account “surrogate pairs” but not “grapheme clusters”)

折月煮酒 提交于 2019-12-28 16:45:33
问题 Splitting a JavaScript string into "characters" can be done trivially but there are problems if you care about Unicode (and you should care about Unicode). JavaScript natively treats characters as 16-bit entities (UCS-2 or UTF-16) but this does not allow for Unicode characters outside the BMP (Basic Multilingual Plane). To deal with Unicode characters beyond the BMP, JavaScript must take into account "surrogate pairs", which it does not do natively. I'm looking for how to split a js string by

PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

浪子不回头ぞ 提交于 2019-12-24 19:37:32
问题 who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs. See the following example: How it is now: $ te = u'\U0001f600\U0001f600' $ flags1 = regex.findall(".", te, re.UNICODE) $ flags1 >>> [u'\ud83d', u'\ude00', u'\ud83d', u'\ude00'] My wish: $ te = u'\U0001f600\U0001f600' $ flags1 = regex.findall(".", te, re.UNICODE) $ flags1 >>> [u'\U0001f600', u'\U0001f600'] Why am i actually need it, because i want to iterate over unicode string and get

What are surrogate characters in UTF-8?

烈酒焚心 提交于 2019-12-24 11:16:04
问题 I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined by their HEX bytes representation. Two such sets are D800-DB7F and DC00-DFFF . Php regexp comparing function called preg_match fails during these comparsions and it says that DC00-DFFF characters are not allowed in this function. From wikipedia I

Unicode surrogate pairs

…衆ロ難τιáo~ 提交于 2019-12-22 00:53:56
问题 Say I have a surrogate pair. For example: \u306f\u30fc Is there a function I can use to print the character to the screen? 回答1: If you want to do it manually: echo chr(0x30) . chr(0x6f) . chr(0x30) . chr(0xfc); If you have the string, you could always do: $callback = function($match) { return chr(hexdec($match[1])) . chr(hexdec($match[2])); } preg_replace_callback('#\\\\u([0-9a-f]{2})([0-9a-f]{2})#', $callback, $string); Or, if php < 5.3 $callback = create_function('$match', 'return chr

Checking for illegal surrogates in Python 3 strings

大城市里の小女人 提交于 2019-12-20 02:37:14
问题 Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match: re.search(r'[\uD800-\uDFFF]', s) Based on the assumption that all legal surrogates would have been represented as astral code points and thus would not match, leaving out the illegal surrogates, or is there caveats and edge cases one needs to be aware of? 回答1: Yes, that's correct. Code units 0xD800–0xDFFF don't represent valid characters in wide Unicode strings, and in Python 3.3+