How to get the accurate length of a std::string?

问题

I am trimming a long std::string to fit it in a text container using this code.

std::string AppDelegate::getTrimmedStringWithRange(std::string text, int range)
{
    if (text.length() > range)
    {
        std::string str(text,0,range-3);
        return str.append("...");
    }
    return text;
}

but in case of other languages like HINDI "हिन्दी" the length of std::string is wrong.

My question is how can i retrieve accurate length of the std::string in all test cases.

Thanks

回答1:

Assuming you're using UTF-8, you can convert your string to a simple (hah!) Unicode and count the characters. I grabbed this example from rosettacode.

#include <iostream>
#include <codecvt>
int main()
{
    std::string utf8 = "\x7a\xc3\x9f\xe6\xb0\xb4\xf0\x9d\x84\x8b"; // U+007a, U+00df, U+6c34, U+1d10b
    std::cout << "Byte length: " << utf8.size() << '\n';
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    std::cout << "Character length: " << conv.from_bytes(utf8).size() << '\n';
}

回答2:

The length of std::string is not "wrong"; you've simply misunderstood what it means. A std::string stores bytes, not "characters" in your chosen encoding. It gleefully has no knowledge of that layer. As such, the length of std::string is the number of bytes it contains.

To count such "characters", you will need a library that supports analysis of your chosen encoding, whatever that is.

Only if your chosen encoding is ASCII-compatible can you just count the bytes and be done with it.

回答3:

As explained in the comments, the length will return the number of bytes of your string which is encoded in utf8. In this multibyte encoding, non ascii chars are encoded on 2 to 6 bytes, so that your utf8 string length will appear longer than the real number of unicode letters.

Solution 1

If you have many long strings, you can keep them in utf8. The utf8 encoding makes it relatively easy to find out the additional multibyte characters: they a all start with 10xxxxxx in binary. So count the number of such additional bytes, and substract this from the string length

cout << "Bytes: " << s.length() << endl;
cout << "Unicode length " << (s.length() - count_if(s.begin(), s.end(), [](char c)->bool { return (c & 0xC0) == 0x80; })) << endl;

Solution 2

If more processing is needed than just counting the length, you could think of using wstring_convert::from_bytes() in the standard library to convert your string into a wstring. The length of the wstring should be what you expect.

wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cv;
wstring w = cv.from_bytes(s);
cout << "Unicode length " << w.length() << endl;

Attention: wstring on linux is based on 32 bits wchar_t and one such wide char can contain all the unicode characeter set. So this is perfect. On windows however, wchar_t is only 16 bits, so some characters might still require multi-word encoding. Fortunately, all the hindi characters are in the range U+0000 to U+D7FF which can be encoded on one 16 bit word. So it should be ok also .

来源：https://stackoverflow.com/questions/31652407/how-to-get-the-accurate-length-of-a-stdstring

标签

c++

string

std