Check for invalid UTF8

后端 未结 3 1494
忘掉有多难
忘掉有多难 2020-12-05 08:51

I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a characte

3条回答
  •  鱼传尺愫
    2020-12-05 09:15

    Good answer already, I'm just chipping in another take on this for fun.

    UTF-8 uses a general scheme by Prosser and Thompson to encode large numbers in single-byte sequences. This scheme can actually represent 2^36 values, but for Unicode we only need 2^21. Here's how it works. Let N be the number you want to encode (e.g. a Unicode codepoint):

    • If N < 128, just one byte 0nnnnnnn. The highest bit is zero.
    • Otherwise, several bytes. The first byte starts with as many ones as there are bytes in the sequence, followed by a zero, and then the data bits; successive bytes start with 10 followed by six data bits. Examples:
    • 3 byte sequence: 1110xxxx 10xxxxxx 10xxxxxx.
    • 5 byte sequence: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.
    • 7 byte sequence: 11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.

    A k-byte sequence fits 5 k + 1 bits (when k > 1), so you can determine how many bytes you need given N. For decoding, read one byte; if its top bit is zero, store its value as is, otherwise use the first byte to figure out how many bytes are in the sequence and process all those.

    For Unicode as of today we only need at most k = 4 bytes.

提交回复
热议问题