UTF-8 validation in PHP without using preg_match()

后端 未结 5 687
梦如初夏
梦如初夏 2021-01-01 01:43

I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:

preg_match(\'/\\A(
     [\\x09\\x0A\\x0D\\x20-\\         


        
相关标签:
5条回答
  • 2021-01-01 02:13

    Here is a string-function based solution:

    http://www.php.net/manual/en/function.mb-detect-encoding.php#85294

    <?php
    function is_utf8($str) {
        $c=0; $b=0;
        $bits=0;
        $len=strlen($str);
        for($i=0; $i<$len; $i++){
            $c=ord($str[$i]);
            if($c > 128){
                if(($c >= 254)) return false;
                elseif($c >= 252) $bits=6;
                elseif($c >= 248) $bits=5;
                elseif($c >= 240) $bits=4;
                elseif($c >= 224) $bits=3;
                elseif($c >= 192) $bits=2;
                else return false;
                if(($i+$bits) > $len) return false;
                while($bits > 1){
                    $i++;
                    $b=ord($str[$i]);
                    if($b < 128 || $b > 191) return false;
                    $bits--;
                }
            }
        }
        return true;
    }
    ?>
    
    0 讨论(0)
  • 2021-01-01 02:15

    You should be able to use iconv to check for validity. Just try and convert it to UTF-16 and see if you get an error.

    0 讨论(0)
  • 2021-01-01 02:26

    You can always using the Multibyte String Functions:

    If you want to use it a lot and possibly change it at sometime:

    1) First set the encoding you want to use in your config file

    /* Set internal character encoding to UTF-8 */
    mb_internal_encoding("UTF-8");
    

    2) Check the String

    if(mb_check_encoding($string))
    {
        // do something
    }
    

    Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

    if(mb_check_encoding($string, 'UTF-8'))
    {
        // do something
    }
    
    0 讨论(0)
  • 2021-01-01 02:29

    Given that there is still no explicit isUtf8() function in PHP, here's how UTF-8 can be accurately validated in PHP depending on your PHP version.

    Easiest and most backwards compatible way to properly validate UTF-8 is still via regular expression using function such as:

    function isValid($string)
    {
        return preg_match(
            '/\A(?>
                [\x00-\x7F]+                       # ASCII
              | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
              |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
              | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
              |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
              |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
              | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
              |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*\z/x',
            $string
        ) === 1;
    }
    

    Note the two key differences to the regular expression offered by W3C. It uses once only subpattern and has a '+' quantifier after the first character class. The problem of PCRE crashing still persists, but most of it is caused by using repeating capturing subpattern. By turning the pattern to a once only pattern and capturing multiple single byte characters in single subpattern, it should prevent PCRE from quickly running out of stack (and causing a segfault). Unless you're validating strings with lots of multibyte characters (in the range of thousands), this regular expression should serve you well.

    Another good alternative is using mb_check_encoding() if you have the mbstring extension available. Validating UTF-8 can be done as simply as:

    function isValid($string)
    {
        return mb_check_encoding($string, 'UTF-8') === true;
    }
    

    Note, however, that if you're using PHP version prior to 5.4.0, this function has some flaws in it's validation:

    • Prior to 5.4.0 the function accepts code point beyond allowed Unicode range. This means it also allows 5 and 6 byte UTF-8 characters.
    • Prior to 5.3.0 the function accepts surrogate code points as valid UTF-8 characters.
    • Prior to 5.2.5 the function is completely unusable due to not working as intended.

    As the internet also lists numerous other ways to validate UTF-8, I will discuss some of them here. Note that the following should be avoided in most cases.

    Use of mb_detect_encoding() is sometimes seen to validate UTF-8. If you have at least PHP version 5.4.0, it does actually work with the strict parameter via:

    function isValid($string)
    {
        return mb_detect_encoding($string, 'UTF-8', true) === 'UTF-8';
    }
    

    It is very important to understand that this does not work prior to 5.4.0. It's very flawed prior to that version, since it only checks for invalid sequences but allows overlong sequences and invalid code points. In addition, you should never use it for this purpose without the strict parameter set to true (it does not actually do validation without the strict parameter).

    One nifty way to validate UTF-8 is via the use of 'u' flag in PCRE. Though poorly documented, it also validates the subject string. An example could be:

    function isValid($string)
    {
        return preg_match('//u', $string) === 1;
    }
    

    Every string should match an empty pattern, but usage of the 'u' flag will only match on valid UTF-8 strings. However, unless you're using at least 5.5.10. The validation is flawed as follows:

    • Prior to 5.5.10, it does not recognize 3 and 4 byte sequences as valid UTF-8. As it excludes most of unicode code points, this is pretty major flaw.
    • Prior to 5.2.5 it also allows surrogates and code points beyond allowed unicode space (e.g. 5 and 6 byte characters)

    Using the 'u' flag behavior does have one advantage though: It's the fastest of the discussed methods. If you need speed and you're running the latest and greatest PHP version, this validation method might be for you.

    One additional way to validate for UTF-8 is via json_encode(), which expects input strings to be in UTF-8. It does not work prior to 5.5.0, but after that, invalid sequences return false instead of a string. For example:

    function isValid($string)
    {
        return json_encode($string) !== false;
    }
    

    I would not recommend on relying on this behavior to last, however. Previous PHP versions simply produced an error on invalid sequences, so there is no guarantee that the current behavior is final.

    0 讨论(0)
  • 2021-01-01 02:32

    Have you tried ereg() instead of preg_match? Perhaps this one doesn't have that bug, and you don't need a potentially buggy workaround.

    0 讨论(0)
提交回复
热议问题