How to protect against diacritics such as Zalgo text

狂风中的少年 提交于 2019-11-28 04:20:47
bobince

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

I think I found a solution using NormalizationForm.FormC instead of NormalizationForm.FormD. According to the MSDN:

[FormC] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.

I take that to mean that it decomposes characters to their base form, then recomposes them based on a set of rules that remain consistent. I gather this is useful for comparison purposes, but in my case it works perfect. Characters like ü, é, and Ä are decomposed/recomposed accurately, while the bogus characters fail to recompose, and thus remain in their base form:

Matas Vaitkevicius

Here's a regex that should fish out all the Zalgo including ones bypassed in 'normal' range.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's a multitude of solutions.

Hope this saves you some time.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!