What's up with these Unicode combining characters and how can we filter them?

后端 未结 4 2042
一整个雨季
一整个雨季 2020-12-02 05:00

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิ

4条回答
  •  误落风尘
    2020-12-02 05:25

    Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.

    ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})
    

    and it didn't work...

    The catch is that list in wiki does not cover full range of combining characters.

    What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = "e49" which in not within a range of combining, it falls into 'Private use'.

    In C# they fall under UnicodeCategory.NonSpacingMark and following script flushes them out:

        [Test]
        public void IsZalgo()
        {
            var zalgo = new[] { UnicodeCategory.NonSpacingMark };
    
            File.Delete("IsModifyLike.html");
            File.AppendAllText("IsModifyLike.html", "");
            for (var i = 0; i < 65535; i++)
            {
                var c = (char)i;
                if (zalgo.Contains(Char.GetUnicodeCategory(c)))
                {
    
    
                    File.AppendAllText("IsModifyLike.html", string.Format("\n",  i.ToString("X"), c, Char.GetUnicodeCategory(c), i));
    
                }
            }
            File.AppendAllText("IsModifyLike.html", "
    {0}{1}{2}A&#{3};&#{3};&#{3}
    "); }

    By looking at the table generated you should be able to see which ones do stack. One range that is missing on wiki is 06D6-06DC another 0730-0749.

    UPDATE:

    Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.

    ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})
    

    The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.

    Hope this saves you some time.

提交回复
热议问题