c reading non ASCII characters

前端未结

关注

 3  728

星月不相逢 2021-01-13 09:58

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows<

3条回答

粉色の甜心 (楼主)

2021-01-13 10:08

Let's say you use UTF-8.

You need to understand how UTF-8 works.

Here's a little piece of work which should do what you want :

int nbChars(char *str) {
    int len = 0;
    int i = 0;
    int charSize = 0; // Size of the current char in byte

    if (!str)
        return -1;
    while (str[i])
    {
        if (charSize == 0)
        {
            ++len;
            if (!(str[i] >> 7 & 1)) // ascii char
                charSize = 1;
            else if (!(str[i] >> 5 & 1))
                charSize = 2;
            else if (!(str[i] >> 4 & 1))
                charSize = 3;
            else if (!(str[i] >> 3 & 1))
                charSize = 4;
            else
                return -1; // not supposed to happen
        }
        else if (str[i] >> 6 & 3 != 2)
            return -1;
        --charSize;
        ++i;
    }
    return len;
}

It returns the number of chars, and -1 if it's not a valid UTF-8 string.

(By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)

EDIT: As stated in the comment section, this code doesn't handle decomposed unicode

0 讨论(0)

查看其它3个回答