Regex to detect Invalid UTF-8 String

前端 未结 4 1421
挽巷
挽巷 2020-11-29 02:13

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that\'s not a portable solution as it requires the mbstring extension to be compiled in a

4条回答
  •  一个人的身影
    2020-11-29 02:36

    You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

    $regex = '/(
        [\xC0-\xC1] # Invalid UTF-8 Bytes
        | [\xF5-\xFF] # Invalid UTF-8 Bytes
        | \xE0[\x80-\x9F] # Overlong encoding of prior code point
        | \xF0[\x80-\x8F] # Overlong encoding of prior code point
        | [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
        | [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
        | [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
        | (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
        | (?

    We can test it by creating a few variations of text:

    // Overlong encoding of code point 0
    $text = chr(0xC0) . chr(0x80);
    var_dump(preg_match($regex, $text)); // int(1)
    // Overlong encoding of 5 byte encoding
    $text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
    var_dump(preg_match($regex, $text)); // int(1)
    // Overlong encoding of 6 byte encoding
    $text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);        
    var_dump(preg_match($regex, $text)); // int(1)
    // High code-point without trailing characters
    $text = chr(0xD0) . chr(0x01);
    var_dump(preg_match($regex, $text)); // int(1)
    

    etc...

    In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

    preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points
    

提交回复
热议问题