c reading non ASCII characters

前端 未结 3 728
星月不相逢
星月不相逢 2021-01-13 09:58

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows<

3条回答
  •  粉色の甜心
    2021-01-13 10:08

    Let's say you use UTF-8.

    You need to understand how UTF-8 works.

    Here's a little piece of work which should do what you want :

    int nbChars(char *str) {
        int len = 0;
        int i = 0;
        int charSize = 0; // Size of the current char in byte
    
        if (!str)
            return -1;
        while (str[i])
        {
            if (charSize == 0)
            {
                ++len;
                if (!(str[i] >> 7 & 1)) // ascii char
                    charSize = 1;
                else if (!(str[i] >> 5 & 1))
                    charSize = 2;
                else if (!(str[i] >> 4 & 1))
                    charSize = 3;
                else if (!(str[i] >> 3 & 1))
                    charSize = 4;
                else
                    return -1; // not supposed to happen
            }
            else if (str[i] >> 6 & 3 != 2)
                return -1;
            --charSize;
            ++i;
        }
        return len;
    }
    

    It returns the number of chars, and -1 if it's not a valid UTF-8 string.

    (By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)

    EDIT: As stated in the comment section, this code doesn't handle decomposed unicode

提交回复
热议问题