Can UTF-8 contain zero byte?

后端 未结 3 1260
日久生厌
日久生厌 2020-11-29 10:00

Can UTF-8 string contain zerobytes? I\'m going to send it over ascii plaintext protocol, should I encode it with something like base64?

3条回答
  •  [愿得一人]
    2020-11-29 10:34

    Yes, the zero byte in UTF8 is code point 0, NUL. There is no other Unicode code point that will be encoded in UTF8 with a zero byte anywhere within it.

    The possible code points and their UTF8 encoding are:

    Range              Encoding  Binary value
    -----------------  --------  --------------------------
    U+000000-U+00007f  0xxxxxxx  0xxxxxxx
    
    U+000080-U+0007ff  110yyyxx  00000yyy xxxxxxxx
                       10xxxxxx
    
    U+000800-U+00ffff  1110yyyy  yyyyyyyy xxxxxxxx
                       10yyyyxx
                       10xxxxxx
    
    U+010000-U+10ffff  11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                       10zzyyyy
                       10yyyyxx
                       10xxxxxx
    

    You can see that all the non-zero ASCII characters are represented as themselves while all mutibyte sequences have a high bit of 1 in all their bytes.

    You may need to be careful that your ascii plaintext protocol doesn't treat non-ASCII characters badly (since that will be all non-ASCII code points).

提交回复
热议问题