Regex to detect Invalid UTF-8 String

前端 未结 4 1415
挽巷
挽巷 2020-11-29 02:13

In PHP, we can use mb_check_encoding() to determine if a string is valid UTF-8. But that\'s not a portable solution as it requires the mbstring extension to be compiled in a

4条回答
  •  抹茶落季
    2020-11-29 02:38

    The W3C has a page (titled Multilingual form encoding) that lists the following Perl regular expression which matches a valid UTF-8 string.

    (Note that this is the opposite of the regex listed in another answer to this SO question which matches an invalid UTF-8 string.)

    #  Returns true if $field is UTF-8, and false otherwise.
    
    $field =~
      m/\A(
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
      )*\z/x;
    

提交回复
热议问题