How to handle user input of invalid UTF-8 characters?

后端 未结 9 2022
小鲜肉
小鲜肉 2020-11-29 17:26

I\'m looking for general a strategy/advice on how to handle invalid UTF-8 input from users.

Even though my webapp uses UTF-8, somehow some users enter invalid chara

9条回答
  •  春和景丽
    2020-11-29 18:16

    The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, they are not forced to submit that in that way, crappy form submission bots are a good example...

    What I usually do is ignore bad chars, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions, if you use iconv you also have the option to transliterate bad chars.

    Here is an example using iconv():

    $str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
    $str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
    

    If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis, something like this would probably do just fine:

    function utf8_clean($str)
    {
        return iconv('UTF-8', 'UTF-8//IGNORE', $str);
    }
    
    $clean_GET = array_map('utf8_clean', $_GET);
    
    if (serialize($_GET) != serialize($clean_GET))
    {
        $_GET = $clean_GET;
        $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
    }
    
    // $_GET is clean!
    

    You may also want to normalize new lines and strip (non-)visible control chars, like this:

    function Clean($string, $control = true)
    {
        $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
    
        if ($control === true)
        {
                return preg_replace('~\p{C}+~u', '', $string);
        }
    
        return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
    }
    

    Code to convert from UTF-8 to Unicode codepoints:

    function Codepoint($char)
    {
        $result = null;
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
    
        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result = sprintf('U+%04X', $codepoint[1]);
        }
    
        return $result;
    }
    
    echo Codepoint('à'); // U+00E0
    echo Codepoint('ひ'); // U+3072
    

    Probably faster than any other alternative, haven't tested it extensively though.


    Example:

    $string = 'hello world�';
    
    // U+FFFEhello worldU+FFFD
    echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
    
    function Bad_Codepoint($string)
    {
        $result = array();
    
        foreach ((array) $string as $char)
        {
            $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
    
            if (is_array($codepoint) && array_key_exists(1, $codepoint))
            {
                $result[] = sprintf('U+%04X', $codepoint[1]);
            }
        }
    
        return implode('', $result);
    }
    

    Is this what you were looking for?

提交回复
热议问题