I\'ve seen some very clever code out there for converting between Unicode codepoints and UTF-8 so I was wondering if anybody has (or would enjoy devising) this.
It's not an algorithm, but if I understand correctly the rules are as such:
0 adds 2 bytes (1 UTF-16 code unit)
110 or 1110 adds 2 bytes (1 UTF-16 code unit)
1111) adds 4 bytes (2 UTF-16 code units)
10) can be skipped
I'm not a C expert, but this looks easily vectorizable.