XMLCh to wchar_t and vice versa

好久不见. 提交于 2021-01-24 12:20:36

问题


My config:

  • Compiler: gnu gcc 4.8.2
  • I compile with C++11
  • platform/OS: Linux 64bit Ubuntu 14.04.1 LTS

I want to feed a method with wchar_t* and use it in many xecerces library methods that need XMLCh* but I don't know how to translate from one to another. It's easy if you use char* instead of wchar_t* but I need to use wide character. Under windows I could easily cast from one to another but it doesn't work in my linux machine. Somehow I have to manually translate wchar_t* to a XMLCh*

I link throught the library libxerces-c-3.1.so which uses XMLCh* exclusively. XMLCh can deal with wide character, but I don't know how to feed it to it, and also how to get a wchar_t* back from a XMLCh*

I developed this but it doesn't work (here I spit out a wstring which is easier to manage in cleaning up the memory than a pointer:

static inline std::wstring XMLCh2W(const XMLCh* tagname)
{
    std::wstring wstr;
    XMLSize_t len1 = XMLString::stringLen(tagname);
    XMLSize_t outLen = len1 * 4;
    XMLByte ut8[outLen+1];
    XMLSize_t charsEaten = 0;
    XMLTransService::Codes failReason; //Ok | UnsupportedEncoding | InternalFailure | SupportFilesNotFound
    XMLTranscoder* transcoder = XMLPlatformUtils::fgTransService->makeNewTranscoderFor("UTF-8", failReason,16*1024);

    unsigned int utf8Len = transcoder->transcodeTo(tagname,len1,ut8,outLen,charsEaten,XMLTranscoder::UnRep_Throw);// XMLTranscoder::UnRep_Throw UnRep_RepChar

    ut8[utf8Len] = 0;
    std::wstring wstr = std::wstring((wchar_t*)ut8);//I'm not sure this is actually ok to do
    return wstr;
}

回答1:


No, you can't do that under GCC, because GCC defines wchar_t as a 32-bit, UTF-32/UCS-4-encoded (the difference is not important for practical purposes) string while Xerces-c defines XmlCh as a 16-bit UTF-16-encoded string.

The best I've found is to use the C++11 support for UTF-16 strings:

  • char16_t and XmlCh are equivalent, though not implicitly convertible; you still need to cast between them. But at least this is cheap, compared to transcoding.
  • std::basic_string<char16_t> is the equivalent string type.
  • Use literals of the form u"str" and u's'.

Unfortunately, VC++ doesn't support the C++11 UTF-16 literals, though wchar_t literals are UTF-16 encoded. So I end up with something like this in a header:

#if defined _MSC_VER
#define U16S(x) L##x
typedef wchar_t my_u16_char_t;
typedef std::wstring my_u16_string_t;
typedef std::wstringstream my_u16_sstream_t;
inline XmlCh* XmlString(my_u16_char_t* s) { return s; }
inline XmlCh* XmlString(my_u16_string_t* s) { return s.c_str(); }
#elif defined __linux
#define U16S(x) u##x
typedef char16_t my_u16_char_t;
typedef std::basic_string<my_u16_char_t> my_u16_string_t;
typedef std::basic_stringstream<my_u16_char_t> my_u16_sstream_t;
inline XmlCh* XmlString(my_u16_char_t* s) { return reinterpret_cast<XmlCh*>(s); }
inline XmlCh* XmlString(my_u16_string_t* s) { return XmlString(s.c_str()); }
#endif

It is, IMO, rather a mess, but not one I can see getting sorted out until VC++ supports C++11 Unicode literals, allowing Xerces to be rewritten in terms of char16_t directly.




回答2:


XMLCh is defined by wchar_t (on windows) or uint16_t (on Linux) and it is encoded with UTF-16.

Unfortunately, gcc 4.8.2 does not support std::wstring_convert to convert unicode string's encoding. But you can use Boost's locale::conv::utf_to_utf() to convert to/from XMLCh.

#include <boost/locale.hpp>

static inline std::wstring XMLCh2W(const XMLCh* xmlchstr)
{
    std::wstring wstr = boost::locale::conv::utf_to_utf<wchar_t>(xmlchstr);
    return wstr;
}

static inline std::basic_string<XMLCh> W2XMLCh(const std::wstring& wstr)
{
    std::basic_string<XMLCh> xmlstr = boost::locale::conv::utf_to_utf<XMLCh>(wstr);
    return xmlstr;
}

If you want to use wchar_t* or XMLCh*, use c_str() method like below.

const wchar_t* wcharPointer = wstr.c_str();
const XMLCh* xmlchPointer = xmlstr.c_str();



回答3:


I recently dealt with this issue, and now that Visual Studio 2015 supports Unicode character and string literals, this is pretty easy to deal with in a cross-platform way. I use the following macro and static_assert to guarantee correctness:

#define CONST_XMLCH(s) reinterpret_cast<const ::XMLCh*>(u ## s)

static_assert(sizeof(::XMLCh) == sizeof(char16_t), 
    "XMLCh is not sized correctly for UTF-16.");

Example of usage:

const XMLCh* features = CONST_XMLCH("Core");
auto impl = DOMImplementationRegistry::getDOMImplementation(features);

This works because Xerces defines an XMLCh to be 16 bits wide and to hold a UTF-16 string value, which perfectly matches up with definition given by the standard for a string literal prefixed by u. The compiler doesn't know this, and won't implicitly convert between char16_t* and XMLCh*, but you can get around this with a reinterpret_cast. And if for whatever reason you try to compile Xerces on a platform where the sizes don't match up, the static_assert will fail and draw attention to the problem.



来源:https://stackoverflow.com/questions/25839725/xmlch-to-wchar-t-and-vice-versa

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!