Check if UTF-8 character requires maximum three bytes

拜拜、爱过 提交于 2020-01-03 06:32:28

问题


I need to save a user input to database to column with utf8_general_ci encoding which requires maximum three bytes per code point. But if the user input contains characters which uses four bytes (for example emojis), the input is not saved into column. What I need is to check the input to only contain characters that uses maximum three bytes. I know I can just change column encoding to utf8mb4 but I don't want to do it.

So how can I do something like this:

if (maxThreeBytes("😄")) { //return fail
    echo "success";
}
else  {
    echo "fail";
}

Another examples:

maxThreeBytes("a") => true
maxThreeBytes("ščřžý") => true
maxThreeBytes("test this") => true
maxThreeBytes("😄😄") => false
maxThreeBytes("hello 😄") => false
maxThreeBytes("test this") => true
maxThreeBytes("test 😭 this") => false

回答1:


Assuming that $str is UTF-8 encoded:

function maxThreeBytes($str) {
    return preg_match('@[\\xf0-\\xff][\\x80-\\xff][\\x80-\\xff][\\x80-\\xff]@', $str) ? false : true;
}

It checks if the string contains four characters that match 11110xxxb 10xxxxxxb 10xxxxxxb 10xxxxxxb which is the encoding for characters between U+10000 and U+10FFFF.




回答2:


for utf-8 convert:

$input = iconv('UTF-8', 'UTF-8//IGNORE', trim(strip_tags($input)));

for just regex

$input = preg_replace("/[^A-Za-z0-9:[:blank:]]()\+\-/","",$input);

its not a full answer, just an example, wait for more comments You might need more symbols in regex, add the ones you need, play around :<



来源:https://stackoverflow.com/questions/53051684/check-if-utf-8-character-requires-maximum-three-bytes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!