I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:
preg_match(\'/\\A(
[\\x09\\x0A\\x0D\\x20-\\
Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version.
Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as:
function isValid($string)
{
return preg_match(
'/\A(?>
[\x00-\x7F]+ # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x',
$string
) === 1;
}
Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well.
Another good alternative is using mb_check_encoding()
if you have the mbstring extension available. Validating UTF-8 can be done as simply as:
function isValid($string)
{
return mb_check_encoding($string, 'UTF-8') === true;
}
Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation:
As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases.
Use of mb_detect_encoding()
is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via:
function isValid($string)
{
return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8';
}
It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter).
One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be:
function isValid($string)
{
return preg_match('//u', $string) === 1;
}
Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows:
Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you.
One additional way to validate for UTF-8 is via json_encode()
, which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example:
function isValid($string)
{
return json_encode($string) !== false;
}
I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.