substr with characters instead of bytes

问题

Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"

When I do a substring(30,40) it returns " #Person Tätigkeitsdarstellung" beginning with a space. I guess it's counting bytes instead of characters.

Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.

I was wondering if there is a way to avoid this empty space at the beginning of the return value.

Thanks for your help!

回答1:

In utf-8, the code point (character) ä consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size() is 2.

A simple approach is to use std::wstring. wstring uses a character type (wchar_t) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.

A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.

The wstring approach also doesn't take control or composite characters into consideration.

Here's a little test code which converts a std::string containing a multi byte utf-8 character ä and converts it to a wstring:

string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system

Unfortunately, libstdc++ hasn't implemented <codecvt> which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.

Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.

来源：https://stackoverflow.com/questions/25116276/substr-with-characters-instead-of-bytes

标签

c++

string

substring

special-characters

substr