I am converting from UTF8 format to actual value in hex. However there are some invalid sequences of bytes that I need to catch. Is there a quick way to check if a characte
Good answer already, I'm just chipping in another take on this for fun.
UTF-8 uses a general scheme by Prosser and Thompson to encode large numbers in single-byte sequences. This scheme can actually represent 2^36 values, but for Unicode we only need 2^21. Here's how it works. Let N be the number you want to encode (e.g. a Unicode codepoint):
0nnnnnnn. The highest bit is zero.10 followed by six data bits. Examples:1110xxxx 10xxxxxx 10xxxxxx.111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx.A k-byte sequence fits 5 k + 1 bits (when k > 1), so you can determine how many bytes you need given N. For decoding, read one byte; if its top bit is zero, store its value as is, otherwise use the first byte to figure out how many bytes are in the sequence and process all those.
For Unicode as of today we only need at most k = 4 bytes.