combining-marks

python isalpha doesn't handle unicode combing marks properly?

烈酒焚心 提交于 2019-12-22 17:52:53
问题 I encountered weird ukrainian word Кири́лл . I converted it to unicode and tested it with isalpha, which returned False. I looked around and found that this word contains character named 'combining acute accent'. So the letter и́ is actually a combination of two characters: и and ́ . If I understood it correctly, combining marks (like this acute accent) are intended only to modify other characters. So isalpha should recognize this string as a word. Am I wrong? Is there any way to get correct

What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?

冷暖自知 提交于 2019-11-29 07:11:32
问题 What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode? They seem to do the same thing, as far as I can tell – although the set of grapheme extenders is larger than the set of combining characters. I’m clearly missing something here. Why the distinction? The Unicode Standard, Chapter 3, D52 Combining character: A character with the General Category of Combining Mark (M). Combining characters consist of all characters with the General Category values of

How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?

▼魔方 西西 提交于 2019-11-29 00:57:18
问题 I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to: a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or, b) reduced to maximum 8 consecutive

detect any combining character in Java

瘦欲@ 提交于 2019-11-28 09:42:40
问题 I am looking for a way to detect if a character in a java string "is a combining character" or not. For instance, String khmerCombiningVowel = new String(new byte[]{(byte) 0xe1,(byte) 0x9f,(byte) 0x80}, "UTF-8"); // unicode 17c0 represents a combining Khmer vowel sign. I have tried "\\p{InCombiningDiacriticalMarks}" regex but it doesn't seem to apply to these particular combining characters. Or even if there is some comprehensive list of all unicode combining character blocks I might be able

What's up with these Unicode combining characters and how can we filter them?

醉酒当歌 提交于 2019-11-28 02:48:49
กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ These recently showed up in facebook comment sections. How can we sanitize this? What's up with these unicode characters? That's a character with a series of combining characters . Because the combining

What's up with these Unicode combining characters and how can we filter them?

▼魔方 西西 提交于 2019-11-26 18:46:45
问题 กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ These recently showed up in facebook comment sections. How can we sanitize this? 回答1: What's up with