问题
I need to save a user input to database to column with utf8_general_ci
encoding which requires maximum three bytes per code point. But if the user input contains characters which uses four bytes (for example emojis), the input is not saved into column. What I need is to check the input to only contain characters that uses maximum three bytes. I know I can just change column encoding to utf8mb4
but I don't want to do it.
So how can I do something like this:
if (maxThreeBytes("😄")) { //return fail
echo "success";
}
else {
echo "fail";
}
Another examples:
maxThreeBytes("a") => true
maxThreeBytes("ščřžý") => true
maxThreeBytes("test this") => true
maxThreeBytes("😄😄") => false
maxThreeBytes("hello 😄") => false
maxThreeBytes("test this") => true
maxThreeBytes("test 😭 this") => false
回答1:
Assuming that $str
is UTF-8 encoded:
function maxThreeBytes($str) {
return preg_match('@[\\xf0-\\xff][\\x80-\\xff][\\x80-\\xff][\\x80-\\xff]@', $str) ? false : true;
}
It checks if the string contains four characters that match 11110xxxb 10xxxxxxb 10xxxxxxb 10xxxxxxb
which is the encoding for characters between U+10000 and U+10FFFF.
回答2:
for utf-8 convert:
$input = iconv('UTF-8', 'UTF-8//IGNORE', trim(strip_tags($input)));
for just regex
$input = preg_replace("/[^A-Za-z0-9:[:blank:]]()\+\-/","",$input);
its not a full answer, just an example, wait for more comments You might need more symbols in regex, add the ones you need, play around :<
来源:https://stackoverflow.com/questions/53051684/check-if-utf-8-character-requires-maximum-three-bytes