In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that\'s not a portable solution as it requires the mbstring extension to be compiled in a
You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.
$regex = '/(
[\xC0-\xC1] # Invalid UTF-8 Bytes
| [\xF5-\xFF] # Invalid UTF-8 Bytes
| \xE0[\x80-\x9F] # Overlong encoding of prior code point
| \xF0[\x80-\x8F] # Overlong encoding of prior code point
| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?
We can test it by creating a few variations of text:
// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)
etc...
In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points