Unable to extract Unicode symbols from C++ std::string

偶尔善良 提交于 2021-01-28 11:22:06

问题


I am looking to read a C++ std::string, then passing that std::string to a function which would analyse it, then extract Unicode symbols & simple ASCII symbols from it.

I searched many tutorials online, but all of them mentioned that standard C++ does not fully support Unicode format. Many of them mentioned to use ICU C++.

This is my C++ program for understanding the very basic of above functionalities. It reads the raw string, converts to ICU Unicode String & prints that:

#include <iostream>
#include <string>
#include "unicode/unistr.h"

int main()
{
    std::string s="Hello☺";
    // at this point s contains a line of text
    // which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
}

Expected Output:

Hello☺

Actual Output:

Hello?

Please suggest what am I doing wrong. Also suggest any alternative/simpler approaches

Thanks

Update 1 (Older): The working code is as follows:

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"

void f(const std::string & s)
{
  std::wcout << "Inside called function" << std::endl;
  constexpr char locale_name[] = "";
  setlocale( LC_ALL, locale_name );
  std::locale::global(std::locale(locale_name));
  std::ios_base::sync_with_stdio(false);
  std::wcin.imbue(std::locale());
  std::wcout.imbue(std::locale());

  // at this point s contains a line of text which may be ANSI or UTF-8 encoded

  // convert std::string to ICU's UnicodeString
  icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

  // convert UnicodeString to std::wstring
  std::wstring ws;
  for (int i = 0; i < ucs.length(); ++i)
    ws += static_cast<wchar_t>(ucs[i]);

  std::wcout << ws << std::endl;
}

int main()
{
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "Inside main function" << std::endl;

    std::string s=u8"hello☺";
    // at this point s contains a line of text which may be ANSI or UTF-8 encoded

    // convert std::string to ICU's UnicodeString
    icu::UnicodeString ucs = icu::UnicodeString::fromUTF8(icu::StringPiece(s.c_str()));

    // convert UnicodeString to std::wstring
    std::wstring ws;
    for (int i = 0; i < ucs.length(); ++i)
      ws += static_cast<wchar_t>(ucs[i]);

    std::wcout << ws << std::endl;
    std::wcout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

Now, both the expected output & actual output are same, i.e.:

Inside main function
hello☺
--------------------------------
Inside called function
hello☺

Update 2 (Latest): The code mentioned in Update 1 does not work for UTF32 symbols like 😆. So, the working code for all possible Unicode symbols is as follows. Special thanks to @Botje for his solution. I wish I can give more than one tick to his solution!!! :)

#include <iostream>
#include <string>
#include <locale>
#include "unicode/unistr.h"
#include "unicode/ustream.h"

void f(const std::u32string & s)
{
  std::wcout << "INSIDE CALLED FUNCTION:" << std::endl;

  icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
  std::cout << "Unicode string is: " << ustr << std::endl;

  std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

  std::cout << "Individual characters of the string are:" << std::endl;
  for(int i=0; i < ustr.countChar32(); i++)
    std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

  std::cout << "--------------------------------" << std::endl;
}

int main()
{
    std::cout << "--------------------------------" << std::endl;
    constexpr char locale_name[] = "";
    setlocale( LC_ALL, locale_name );
    std::locale::global(std::locale(locale_name));
    std::ios_base::sync_with_stdio(false);
    std::wcin.imbue(std::locale());
    std::wcout.imbue(std::locale());

    std::wcout << "INSIDE MAIN FUNCTION:" << std::endl;

    std::u32string s=U"hello☺😆";

    icu::UnicodeString ustr = icu::UnicodeString::fromUTF32(reinterpret_cast<const UChar32 *>(s.c_str()), s.size());
    std::cout << "Unicode string is: " << ustr << std::endl;

    std::cout << "Size of unicode string = " << ustr.countChar32() << std::endl;

    std::cout << "Individual characters of the string are:" << std::endl;
    for(int i=0; i < ustr.countChar32(); i++)
      std::cout << icu::UnicodeString(ustr.char32At(i)) << std::endl;

    std::cout << "--------------------------------" << std::endl;
    f(s);
    return 0;
}

Now, both the expected output & actual output are same, i.e.:

--------------------------------
INSIDE MAIN FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------
INSIDE CALLED FUNCTION:
Unicode string is: hello☺😆
Size of unicode string = 7
Individual characters of the string are:
h
e
l
l
o
☺
😆
--------------------------------


回答1:


There are a number of stumbling blocks to get this right:

  • First, your file (and the smiley face in it) should be encoded as UTF-8. The smiley face should consist of the literal bytes 0xE2 0x98 0xBA.
  • You should mark the string as containing UTF-8 data using the u8 decorator: u8"Hello☺"
  • Next, the documentation of icu::UnicodeString remarks that it stores Unicode as UTF-16. In this case you are lucky, as U+263A fits in one UTF-16 character. Other emoji might not! You should either convert it to UTF-32, or be very careful and use the GetChar32At function.
  • Finally, the encoding used by wcout should be configured with imbue to match the encoding expected by your environment. See the answers to this question.


来源:https://stackoverflow.com/questions/60092291/unable-to-extract-unicode-symbols-from-c-stdstring

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!