Can php detect 4-byte encoded utf8 chars?

后端 未结 2 1731
隐瞒了意图╮
隐瞒了意图╮ 2020-12-13 07:58

I am using a utf8 charset mysql tables in a mysql 5.1 server, which does not support utf8mb4 encoding in tables. When inserting 4-byte encoded utf8 characters like \"

2条回答
  •  半阙折子戏
    2020-12-13 08:22

    This should work:

    if (max(array_map('ord', str_split($string))) >= 240) 
    

    The rational being that code points up to and including U+FFFF are encoded as three bytes of the form 1110xxxx 10xxxxxx 10xxxxxx. Higher code points are of the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx, i.e. the highest byte has a value of 240 or higher. If there are any such bytes in the string, it's an indicator for a 4-byte sequence.

    If you want to remove long characters, this will do:

    preg_replace_callback('/./u', function (array $match) {
        return strlen($match[0]) >= 4 ? null : $match[0];
    }, $string)
    

    Though there may be a more elegant regex way to express high codepoints directly.

提交回复
热议问题