C / C++ UTF-8 upper/lower case conversions

只谈情不闲聊 提交于 2019-11-28 23:20:24

small case sharp s : ß; upper case sharp s : ẞ. Did you use the uppercase version in your assert ? Seems like glibg 2.14 follows implements pre unicode5.1 no upper case version of sharp s, and on the other machine the libc uses unicode 5.1 ẞ=U1E9E ...

Maybe someone would use it (maybe for tests)

With this you could make simple converter :) No additional libs :)

http://pastebin.com/fuw4Uizk

1482 letters

Example

Ь <> ь
Э <> э
Ю <> ю
Я <> я
Ѡ <> ѡ
Ѣ <> ѣ
Ѥ <> ѥ
Ѧ <> ѧ
Ѩ <> ѩ
Ѫ <> ѫ
Ѭ <> ѭ
Ѯ <> ѯ
Ѱ <> ѱ
Ѳ <> ѳ
Ѵ <> ѵ
Ѷ <> ѷ
Ѹ <> ѹ
Ѻ <> ѻ
Ѽ <> ѽ
Ѿ <> ѿ
Ҁ <> ҁ
Ҋ <> ҋ
Ҍ <> ҍ
Ҏ <> ҏ
Ґ <> ґ
Ғ <> ғ
Ҕ <> ҕ
Җ <> җ
Ҙ <> ҙ
Қ <> қ
Ҝ <> ҝ
Ҟ <> ҟ
Ҡ <> ҡ
Ң <> ң

The following C++11 code works for me (disregarding for a moment the question of how the sharp s should be translated---it's left unchanged. It's slowly being phased out from German anyway).

Optimizations and uppercasing the first letter only are left as an exercise.

Edit: As pointed out, codecvt appears to have been deprecated. It should remain in the standard, however, until a suitable replacement is defined. See Deprecated header <codecvt> replacement

#include <codecvt>
#include <iostream>
#include <locale>

std::locale const utf8("en_US.UTF-8");

// Convert UTF-8 byte string to wstring
std::wstring to_wstring(std::string const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.from_bytes(s);
}

// Convert wstring to UTF-8 byte string
std::string to_string(std::wstring const& s) {
  std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
  return conv.to_bytes(s);
}

// Converts a UTF-8 encoded string to upper case
std::string tou(std::string const& s) {
  auto ss = to_wstring(s);
  for (auto& c : ss) {
    c = std::toupper(c, utf8);
  }
  return to_string(ss);
}

void test_utf8(std::ostream& os) {
  os << tou("foo" ) << std::endl;
  os << tou("#foo") << std::endl;
  os << tou("ßfoo") << std::endl;
  os << tou("Éfoo") << std::endl;
}    

int main() {
  test_utf8(std::cout);
}
unwind

What do you expect the upper-case version of the German ß character to be, for that test case?

In other words, your basic assumptions are wrong.

Note that the Wikipedia in the comment states:

Sharp s is nearly unique among the letters of the Latin alphabet in that it has no traditional upper case form (one of the few other examples is kra, ĸ, which was used in Greenlandic). This is because it never occurs initially in German text, and traditional German printing (which used blackletter) never used all-caps. When using all-caps, the current spelling rules require the replacement of ß with SS.[1] However, in 2010 its use became mandatory in official documentation when writing geographical names in all-caps.[2]

So, the basic test case, with the sharp s occuring as an initial, is violating the rules of German. I still think I have a point, in that the original posters premise is wrong, strings cannot in general be freely converted between upper and lower case, for all languages.

The issue is your locales that do not assert are compliant, your locales on which the assert does fire are non-compliant.

Technical Report N897 required in B.1.2[LC_CTYPE Rationale]:

As the LC_CTYPE character classes are based on the C Standard character-class definition, the category does not support multicharacter elements. For instance, the German character is traditionally classified as a lowercase letter. There is no corresponding uppercase letter; in proper capitalization of German text the will be replaced by SS; i.e., by two characters. This kind of conversion is outside the scope of the toupper and tolower keywords.

This Technical Report was published in Dec-25-'01. But according to: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E

In 2010, the use of the capital ẞ became mandatory in official documentation in Germany when writing geographical names in all-caps

But the topic has not been revisited by the standard committee, so technically independent of what the German government says, the standardized behavior of toupper should be to make no changes to the ß character.

The reason this works inconsistently over machines is setlocale:

Installs the specified system locale or its portion as the new C locale

So it is non-compliant system locale, en_US.utf8 that is instructing toupper to modify the ß character. Unfortunately, the specialization ctype<char>::clasic_table, is not available on ctype<wchar_t> so you cannot modify the behavior. Leaving you with 2 options:

  1. Create a const map<wchar_t, wchar_t> for conversion from every possible lowercase wchar_t to the corresponding uppercase wchar_t
  2. Add a check for an L'ß' like this:

    int ret = wcrtomb(buf, wChar == L'ß' ? L'ẞ' : towupper(wChar), &state);
    

Live Example

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!