How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?

后端未结

关注

 7  2125

小蘑菇

It seems like MySQL does not support characters with more than 3 bytes in its default UTF-8 charset.

So, in PHP, how can I get rid of all 4(-and-more)-byte character

相关标签:

7条回答

面向向阳花

2020-12-14 01:22
Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7, the following should work:
```
$str = preg_replace('/[\xF0-\xF7].../s', '', $str);
```
Alternatively, you could use preg_replace in UTF-8 mode but this will probably be slower:
```
$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str);
```
This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000.
0 讨论(0)
发布评论:

提交评论
- 加载中...
太阳男子

2020-12-14 01:25
Below function change 3 and 4 bytes characters from utf8 string to '#':
```
function remove3and4bytesCharFromUtf8Str($str) {
        return preg_replace('/([\xF0-\xF7]...)|([\xE0-\xEF]..)/s', '#', $str);
    }
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2020-12-14 01:25
Here is my implementation to filter out 4-byte chars
```
$string = preg_replace_callback(
    '/./u',
    function (array $match) {
        return strlen($match[0]) >= 4 ? null : $match[0];
    },
    $string
);
```
you could tweak it and replace null (which removes the char) with some substitute string. You can also replace >= 4 with some other byte-length check.
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-12-14 01:26
Came across this question when trying to solve my own issue (Facebook spits out certain emoticons as 4-byte characters, Amazon Mechanical Turk does not accept 4-byte characters).

I ended up using this, doesn't require mbstring extension:
```
function remove_4_byte($string) {
    $char_array = preg_split('/(?<!^)(?!$)/u', $string );
    for($x=0;$x<sizeof($char_array);$x++) {
        if(strlen($char_array[$x])>3) {
            $char_array[$x] = "";
        }
    }
    return implode($char_array, "");
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-12-14 01:35
Another filter implementation, more complex.

It try transliterate to ASCII characters, otherwise iserts unicode replacement character to avoid XSS, eg.: <a href='java\uFEFFscript:alert("XSS")'>
```
$tr = preg_replace_callback('/([\x{10000}-\x{10FFFF}])/u', function($m){
    $c = iconv('ISO-8859-2', 'UTF-8',iconv('utf-8','ISO-8859-2//TRANSLIT//IGNORE', $m[1]));
    if($c == '')
        return '�';
    return $c;

}, $s);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暗喜

2020-12-14 01:37
NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:

http://unicode.org/reports/tr36/#Deletion_of_Noncharacters
```
preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $value);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页