How to compare multibyte characters in C

问题

I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef but it does not work with öçşğüı. GCC gives compilation warnings. What should I do to work with öçşğüı?

Code :

#include <stdio.h>
#include <ctype.h>
#include <string.h>

int main()
{
    char * text = "öçşğü";
    int i=0;

    text = strdup(text);

    while (text[i])
    {       
        if(text[i] == 'ö')
        {
            printf("ö \n");
        }

        i++;
    }

    return 0;
}

Warning :

warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]

There are 10 addresses when I print address of char in while loop

printf("%d : %p \n", i, text[i]);

output :

0 : 0xffffffc3 
1 : 0xffffffb6 
2 : 0xffffffc3 
3 : 0xffffffa7 
4 : 0xffffffc5 
5 : 0xffffff9f 
6 : 0xffffffc4 
7 : 0xffffff9f 
8 : 0xffffffc3 
9 : 0xffffffbc

and strlen is 10.

But if I use abcde:

0 : 0x61 
1 : 0x62 
2 : 0x63 
3 : 0x64 
4 : 0x65

and strlen is 5.

If I use wchar_t for text output is

0 : 0xa7c3b6c3 
1 : 0x9fc49fc5 
2 : 0xbcc3

and strlen is 10, wcslen is 3.

回答1:

To go through each of the characters in the string, you can use mblen. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen can correctly parse the multi byte string.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>

int main()
{
    char * text = "öçşğü";
    int i=0, char_len;

    setlocale(LC_CTYPE, "en_US.utf8");

    while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
    {
        /* &text[i] contains multibyte character of length char_len */
        if(memcmp(&text[i], "ö", char_len) == 0)
        {
            printf("ö \n");
        }

        i += char_len;
    }

    return 0;
}

There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char * (usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *. wchar_t has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).

Have you looked at your source code using a hex editor? The string "öçşğü" actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text) returns 10 for this, whereas the above code loops only 5 times.

If you use wide-byte string, it can be done as explained by @WillBriggs.

回答2:

There are no standards surrounding embedding non-ASCII characters directly in your source file.

Instead, the C11 standard specifies that you can use Unicode code points:

wchar_t text[] = L"\u00f6\u00e7\u015f\u0131\u011f";

// Print whole string
wprintf(L"%s\n", text);

// Test individual characters
for (size_t i = 0; text[i]; ++i)
{
    if ( text[i] == u'\u00f6' )
        // whatever...
}

If you are in Windows then you face an extra problem that the Windows console can't print Unicode characters by default. You need to do the following:

Change the console to use a TrueType monospaced font which includes glyphs for the characters you are trying to print. (I used "DejaVu Sans Mono" for this example)
In the source code, call the function _setmode(1, _O_WTEXT); , which will need #include <fcntl.h>.

To restore normal text afterwards you can _setmode(1, _O_TEXT);.

Of course, if you are outputting to a file or to a Win32 API function then you don't need to do those steps.

回答3:

See wiki here: https://en.wikipedia.org/wiki/UTF-8 In particular, there is a table with the bit patterns.

Here's another way to scan/convert a utf-8 string into a codepoint [not exact, just an example--refer to wiki]:

// utf8scan -- convert utf8 to codepoints (example)

char inpbuf[1000];
char uni[8];

typedef union {
    char utf8[4];
    unsigned int code;
} codepoint_t;

codepoint_t outbuf[1000];

// unidecode -- decode utf8 char into codepoint
// RETURNS: updated rhs pointer
char *
unidecode(codepoint_t *lhs,char *rhs)
{
    int idx;
    int chr;

    idx = 0;
    lhs->utf8[idx++] = *rhs++;

    for (;  ;  ++rhs, ++idx) {
        chr = *rhs;

        // end of string
        if (chr == 0)
            break;

        // start of new ascii char
        if ((chr & 0x80) == 0)
            break;

        // start of new unicode char
        if (chr & 0x40)
            break;

        lhs->utf8[idx] = chr;
    }

    return rhs;
}

// main -- main program
int
main(void)
{
    char *rhs;
    codepoint_t *lhs;

    rhs = inpbuf;
    lhs = outbuf;

    for (;  *rhs != 0;  ++lhs) {
        lhs->code = 0;

        // ascii char
        if ((*rhs & 0x80) == 0)
            lhs->utf8[0] = *rhs++;

        // get/skip unicode char
        else
            rhs = unidecode(lhs,rhs);
    }

    // add EOS
    lhs->code = 0;

    return 0;
}

回答4:

The best way to handle wide characters is as, well, wide characters.

wchar_t myWord[] = L"Something";

This will do it:

#include <stdio.h>
#include <ctype.h>
#include <string.h>

int main()
{
    wchar_t * text = L"öçşğü";
    int i = 0;

    while (text[i])
    {
        if (text[i] == L'ö')
        {
            wprintf(L"ö \n");
        }

        i++;
    }

    return 0;
}

If you're in Visual Studio, like me, recall that the console window doesn't handle Unicode well. You can redirect it to a file and examine the file, and see the ö.

来源：https://stackoverflow.com/questions/33737803/how-to-compare-multibyte-characters-in-c

标签

compare