UTF-8 Character Count

后端 未结 4 530
时光说笑
时光说笑 2021-01-23 10:17

I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su

4条回答
  •  情深已故
    2021-01-23 10:47

    See: https://en.wikipedia.org/wiki/UTF-8#Encoding

    Each UTF-8 sequence contains one starting byte and zero or more extra bytes. Extra bytes always start with bits 10 and first byte never starts with that sequence. You can use that information to count only first byte in each UTF-8 sequence.

        if((b&0xC0) != 0x80) {
            count++;
        }
    

    Keep in mind this will break, if file contains invalid UTF-8 sequences. Also, "UTF-8 characters" might mean different things. For example "

提交回复
热议问题