UTF-8 Character Count

后端未结

关注

 4  532

时光说笑 2021-01-23 10:17

I\'m programming something that counts the number of UTF-8 characters in a file. I\'ve already written the base code but now, I\'m stuck in the part where the characters are su

4条回答

臣服心动 (楼主)

2021-01-23 10:40
You could look into the specs: https://tools.ietf.org/html/rfc3629.

Chapter 3 has this table in it:
```
   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
```
You could inspect the bytes and build the unicode characters.

A different point is, whether you would count a base character and its accent (combining mark cf. https://en.wikipedia.org/wiki/Combining_character) as one or as several characters.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...