How to uppercase/lowercase UTF-8 characters in C++?

后端未结

关注

 4  2036

野性不改 2021-02-19 09:59

Let\'s imagine I have a UTF-8 encoded std::string containing the following:

óó

and I\'d like to convert it to the following:

4条回答

梦毁少年i (楼主)

2021-02-19 10:25
These case insensitive features are definitely needed in search facilities.

Well, I have the same need as described above and UTF8 is pretty smooth in most ways, but the upper and lower case situations is not that great. Looks like it fall off the todo list when done? Because it has been in the past one of the major topics on the todo list in such cases. I have been patching IBM keyboard driver 1984 before IBM shipped, but copies were available. Also patched Displaywrite 1 and 3 (PC-DOS wordprocessor) before IBM wanted to ship in Europe. Done an awful lot of PC-DOS (CP850) and CP1252 (Windows) to and from national EBCDIC Code pages in IBM 3270 mainframe terminal systems. Them all had this case sensitivity topic on the todo list. In all national ASCII versions and the CP1252 Windows tables had a shift between the 0x40-0x5F and 0x60-0x7F to flip between lower and higher cases (but not PCDOS CP850), by 0x20.

What to do about it?

The tolower() and toupper() will not work in UTF8 multi character strings, outside US-ASCII. They are only working with one byte. But a string solution would work, and there are solutions for about everything else.

Western Europeans are lucky

Well the UTF8 put the CP1252 (Windows 8bit/Latin1) as the first additional table, Latin-1 Supplement (Unicode block), as is. This means that it is possible to shift the Letters (C3XX) like regular US ASCII. Code sample below.

Greeks, Russians, Icelanders and Eastern Europeans are not that lucky

For the Icelanders the Đ/đ - D with stroke (same as the th sound of the word the) is just punched out from CP1252.

The Greeks, Russians and Eastern Europeans ISO8-charsets (CP1253, CP1251 and CP1257) could have been used (as the latin CP1252 was directly used). Then just shifting would also have worked. But instead someone just filled the table pretty randomly (like in the PC-DOC 8-bit ASCII).

There is only one working solution, the same as for PC_DOS ASCII, make translation-tables. I will do it for next X-mas (when I need it bad) but I hint how to do it if someone else is in a hurry.

How to do solutions for the Greeks, Russians, Icelanders and Eastern Europeans

Make different tables relating to the different first byte of the UTF8-table for Eastern Europe, Greek and Cyrillic in the programming code. Fill the tables with the second byte of the letters in its UTF8 second byte positions and exchange the uppercase letters with the matching second byte of the lower cases, and make another one doing the other way around.

Then identify what first byte that relates to each table. That way the programming code can select the right table and just read the right position and get the upper or lower case characters needed. Then modify the letter case functions below (those I have made for Latin1), to use tables instaed of shifting 0x20, for some first UTF8-characters, where tables must be used. It will work smooth and new computers have no problem with memory and power.

UTF8 letter case related functions Latin1 samples

This is working I believe, tried it yet shortly. It only works in Latin-1, and USACII parts of the UTF8.
```
unsigned char *StrToLwrUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x40)
                    && (((*p & 0x7f) + cExtChar) <= 0x5f)))
                *p += 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
unsigned char *StrToUprUft8Latin1(unsigned char *pString)
{
    char cExtChar = 0;
    if (pString && *pString) {
        unsigned char *p = pString;
        while (*p) {
            if (((cExtChar && ((*p >= 0x80) && (*p <= 0xbf)))
                || ((!cExtChar) && (*p <= 0x7f)))
                && ((((*p & 0x7f) + cExtChar) >= 0x60)
                    && (((*p & 0x7f) + cExtChar) <= 0x7e)))
                *p -= 0x20;
            if (cExtChar)
                cExtChar = 0;
            else if (*p == 0xc3)
                cExtChar = 0x40;
            p++;
        }
    }
    return pString;
}
int StrnCiCmpLatin1(const char *s1, const char *s2, size_t ztCount)
{
    unsigned char cExtChar = 0;
    if (s1 && *s1 && s2 && *s2) {
        for (; ztCount--; s1++, s2++) {
            int iDiff = tolower((unsigned char)(*s1 & 0x7f)
                + cExtChar) - tolower((unsigned char)(*s2 & 0x7f) + cExtChar);
            if (iDiff != 0 || !*s1 || !*s2)
                return iDiff;
            if (cExtChar)
                cExtChar = 0;
            else if (((unsigned char )*s2) == ((unsigned char)0xc3))
                cExtChar = 0x40;
        }
    }
    return 0;
}
int StrCiCmpLatin1(const char *s1, const char *s2)
{
    return StrnCiCmpLatin1(s1, s2, (size_t)(-1));
}
char *StrCiStrLatin1(const char *s1, const char *s2)
{
    if (s1 && *s1 && s2 && *s2) {
        char *p = (char *)s1;
        size_t len = strlen(s2);
        while (*p) {
            if (StrnCiCmpLatin1(p, s2, len) == 0)
                return p;
            p++;
        }
    }
    return (0);
}
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...