libxml2 xmlChar * to std::wstring

为君一笑 提交于 2019-12-03 20:34:37

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null

    std::wstring wideString;

    int charLength = xmlStrlen(xmlString);
    if (charLength > 0)
    {
        char *origLocale = setlocale(LC_CTYPE, NULL);
        setlocale(LC_CTYPE, "en_US.UTF-8");

        size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
        if (wcharLength != (size_t)(-1))
        {
            wideString.resize(wcharLength);
            mbtowc(&wideString[0], (const char*) xmlString, charLength);
        }

        setlocale(LC_CTYPE, origLocale);
        if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
    }

    return wideString;
}

A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null
    try
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
        return conv.from_bytes((const char*)xmlString);
    }
    catch(const std::range_error& e)
    {
        abort(); //wstring_convert failed
    }
}

An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

There are some problems in this code, besides the fact that you are using wchar_t and std::wstring which is a bad idea unless you're making calls to the Windows API.

  1. xmlStrlen() does not do what you think it does. It counts the number of UTF-8 code units (a.k.a. bytes) in a string. It does not count the number of characters. This is all stuff in the documentation.

  2. Counting characters will not portably give you the correct size for a wchar_t array anyway. So not only does xmlStrlen() not do what you think it does, what you wanted isn't the right thing either. The problem is that the encoding of wchar_t varies from platform to platform, making it 100% useless for portable code.

  3. The mbtowcs() function is locale-dependent. It only converts from UTF-8 if the locale is a UTF-8 locale!

  4. This code will leak memory if the std::wstring constructor throws an exception.

My recommendations:

  1. Use UTF-8 if at all possible. The wchar_t rabbit hole is a lot of extra work for no benefit (except the ability to make Windows API calls).

  2. If you need UTF-32, then use std::u32string. Remember that wstring has a platform-dependent encoding: it could be a variable-length encoding (Windows) or fixed-length (Linux, OS X).

  3. If you absolutely must have wchar_t, then chances are good that you are on Windows. Here is how you do it on Windows:

    std::wstring utf8_to_wstring(const char *utf8)
    {
        size_t utf8len = std::strlen(utf8);
        int wclen = MultiByteToWideChar(
            CP_UTF8, 0, utf8, utf8len, NULL, 0);
        wchar_t *wc = NULL;
        try {
            wc = new wchar_t[wclen];
            MultiByteToWideChar(
                CP_UTF8, 0, utf8, utf8len, wc, wclen);
            std::wstring wstr(wc, wclen);
            delete[] wc;
            wc = NULL;
            return wstr;
        } catch (std::exception &) {
            if (wc)
                delete[] wc;
        }
    }
    
  4. If you absolutely must have wchar_t and you are not on Windows, use iconv() (see man 3 iconv, man 3 iconv_open and man 3 iconv_close for the manual). You can specify "WCHAR_T" as one of the encodings for iconv().

Remember: You probably don't want wchar_t or std::wstring. What wchar_t does portably isn't useful, and making it useful isn't portable. C'est la vie.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!