Strings and character encoding in C++

前端未结

关注

 3  1814

轮回少年 2021-01-01 22:04

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasona

3条回答

慢半拍i (楼主)

2021-01-01 22:44

It's not specified what character encoding must be used for string, wstring etc. The common way is to use unicode in wide strings. What types and encodings should be used depends on your requirements.

If you only need to pass data from A to B, choose std::string with UTF-8 encoding (don't introduce a new type, just use std::string). If you must work with strings (extract, concat, sort, ...) choose std::wstring and as encoding UCS2/UTF-16 (BMP only) on Windows and UCS4/UTF-32 on Linux. The benefit is the fixed size: each character has a size of 2 (or 4 for UCS4) bytes while std::string with UTF-8 returns wrong length() results.

For conversion, you can check sizeof(std::wstring::value_type) == 2 or 4 to choose UCS2 or UCS4. I'm using the ICU library, but there may be simple wrapper libs.

Deriving from std::string is not recommended because basic_string is not designed for (lacks of virtual members etc..). If you really really really need your own type like std::basic_string< my_char_type > write a custom specialization for this.

The new C++0x standard defines wstring_convert<> and wbuffer_convert<> to convert with a std::codecvt from a narrow charset to a wide charset (for example UTF-8 to UCS2). Visual Studio 2010 has already implemented this, afaik.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...