I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows<
Let's say you use UTF-8.
You need to understand how UTF-8 works.
Here's a little piece of work which should do what you want :
int nbChars(char *str) {
int len = 0;
int i = 0;
int charSize = 0; // Size of the current char in byte
if (!str)
return -1;
while (str[i])
{
if (charSize == 0)
{
++len;
if (!(str[i] >> 7 & 1)) // ascii char
charSize = 1;
else if (!(str[i] >> 5 & 1))
charSize = 2;
else if (!(str[i] >> 4 & 1))
charSize = 3;
else if (!(str[i] >> 3 & 1))
charSize = 4;
else
return -1; // not supposed to happen
}
else if (str[i] >> 6 & 3 != 2)
return -1;
--charSize;
++i;
}
return len;
}
It returns the number of chars, and -1 if it's not a valid UTF-8 string.
(By non-valid UTF-8 string, I mean the format is not valid. I don't check if the character actually exists)
EDIT: As stated in the comment section, this code doesn't handle decomposed unicode