C: Most efficient way to determine how many bytes will be needed for a UTF-16 string from a UTF-8 string

前端未结

关注

 3  1174

粉色の甜心 2020-12-18 03:41

I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.

3条回答

时光说笑 (楼主)

2020-12-18 04:17
It's not an algorithm, but if I understand correctly the rules are as such:
- every byte having a MSB of 0 adds 2 bytes (1 UTF-16 code unit)
  - that byte represents a single Unicode codepoint in the range U+0000 - U+007F
- every byte having the MSBs 110 or 1110 adds 2 bytes (1 UTF-16 code unit)
  - these bytes start 2- and 3-byte sequences respectively which represent Unicode codepoints in the range U+0080 - U+FFFF
- every byte having the 4 MSB set (i.e. starting with 1111) adds 4 bytes (2 UTF-16 code units)
  - these bytes start 4-byte sequences which cover "the rest" of the Unicode range, which can be represented with a low and high surrogate in UTF-16
- every other byte (i.e. those starting with 10) can be skipped
  - these bytes are already counted with the others.
I'm not a C expert, but this looks easily vectorizable.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...