How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?

后端 未结 7 2114
小蘑菇
小蘑菇 2020-12-14 01:00

It seems like MySQL does not support characters with more than 3 bytes in its default UTF-8 charset.

So, in PHP, how can I get rid of all 4(-and-more)-byte character

相关标签:
7条回答
  • 2020-12-14 01:45

    Here's an example:

    <?php 
    
     mb_internal_encoding("UTF-8");
    
     //utf8 string,  13 bytes, 9 utf8 chars, 7 ASCII, 1 in latin1, 1 outside the BMP
     $str = "qué \xF0\x9D\x92\xB3 tal"; 
     $array = mbStringToArray($str);
     print "str: [$str]  strlen:" . strlen($str) . " chars:" . count($array) . "\n";
     $str1 = "";
     foreach($array as $c) {
       //  print "$c : " .  strlen($c)  ."\n";
       $str1 .= strlen($c)<=3? $c : '?';
     }
     print "[$str1]\n";
    
    
     function mbStringToArray ($str) {
        if (empty($str)) return false;
        $len = mb_strlen($str);
        $array = array();
        for ($i = 0; $i < $len; $i++) {
            $array[] = mb_substr($str, $i, 1);
        }
        return $array;
     }
    

    Or, a little more compact and efficient:

    <?php /// 
    
     mb_internal_encoding("UTF-8");
    
     //utf8 string,  13 bytes, 9 utf8 chars, 7 ASCII, 1 in latin1, 1 outside the BMP
     $str = "qué \xF0\x9D\x92\xB3 tal";
     $str1 = trimOutsideBMP($str);
     print "original: [$str]\n";
     print "trimmed:  [$str1]\n";
    
    
     // Replaces non-BMP characters in the UTF-8 string by a '?' character 
     // Assumes UTF-8 default encoding ( if not sure, call first mb_internal_encoding("UTF-8"); )
     function trimOutsideBMP($str) {
        if (empty($str)) return $str;
        $len = mb_strlen($str);
        $str1 = '';
        for ($i = 0; $i < $len; $i++) {
            $c = mb_substr($str, $i, 1);
            $str1 .= strlen($c) <= 3 ? $c : '?';
        }
        return $str1;
     }
    
    0 讨论(0)
提交回复
热议问题