C library to convert unicode code points to UTF8?

前端未结

关注

 4  974

I have to go through some text and write UTF8 output according to the character patterns. I thought it\'ll be easy if I can work with the code points and get it converted to

相关标签:

4条回答

孤独总比滥情好

2020-12-15 00:26
Converting Unicode code points to UTF-8 is so trivial that making the call to a library probably takes more code than just doing it yourself:
```
if (c<0x80) *b++=c;
else if (c<0x800) *b++=192+c/64, *b++=128+c%64;
else if (c-0xd800u<0x800) goto error;
else if (c<0x10000) *b++=224+c/4096, *b++=128+c/64%64, *b++=128+c%64;
else if (c<0x110000) *b++=240+c/262144, *b++=128+c/4096%64, *b++=128+c/64%64, *b++=128+c%64;
else goto error;
```
Also, doing it yourself means you can tune the api to the type of work you need (character-at-a-time? Or long strings?) You can remove the error cases if you know your input is a valid Unicode scalar value.

The other direction is a good bit harder to get correct. I recommend a finite automaton approach rather than the typical bit-arithmetic loops that sometimes decode invalid sequences as aliases for real characters (which is very dangerous and can lead to security problems).

Even if you do end up going with a library, I think you should either try writing it yourself first or at least seriously study the UTF-8 specification before going further. A lot of bad design can come from treating UTF-8 as a black box when the whole point is that it's not a black box but was created to have very powerful properties, and too many programmers new to UTF-8 fail to see this until they've worked with it a lot themselves.
0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-12-15 00:29

iconv could be used I figure.

#include <iconv.h>

iconv_t cd;
char out[7];
wchar_t in = CODE_POINT_VALUE;
size_t inlen = sizeof(in), outlen = sizeof(out);

cd = iconv_open("utf-8", "wchar_t");
iconv(cd, (char **)&in, &inl, &out, &outlen);
iconv_close(cd);

But I fear that wchar_t might not represent Unicode code points, but arbitrary values.. EDIT: I guess you can do it by simply using a Unicode source:

uint16_t in = UNICODE_POINT_VALUE;
cd = iconv_open("utf-8", "ucs-2");

0 讨论(0)

终归单人心

2020-12-15 00:37

Which platform? On Windows, you can use WideCharToMultiByte(CP_UTF8,...)

Arguably, the source codepoint must be encoded in UTF-16, which means you must be able to do such encoding. In some cases (surrogate pairs), it's not trivial.

My understanding is that you have some text in a given codepage and you want to convert it to Unicode (UTF-16). Right? A MultiByteToWideChar(codePage, sourceText,...) / WideCharToMultiByte(CP_UTF8, utf16Text,...) roundtrip will do the trick.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-15 00:40

libiconv.

0 讨论(0)
发布评论:

提交评论
- 加载中...