问题
I try to parse text and find some characters in it. I use the code below. It works with normal characters like abcdef
but it does not work with öçşğüı
. GCC gives compilation warnings. What should I do to work with öçşğüı
?
Code :
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
char * text = "öçşğü";
int i=0;
text = strdup(text);
while (text[i])
{
if(text[i] == 'ö')
{
printf("ö \n");
}
i++;
}
return 0;
}
Warning :
warning: multi-character character constant [-Wmultichar]
warning: comparison is always false due to limited range of data type [-Wtype-limits]
There are 10 addresses when I print address of char in while loop
printf("%d : %p \n", i, text[i]);
output :
0 : 0xffffffc3
1 : 0xffffffb6
2 : 0xffffffc3
3 : 0xffffffa7
4 : 0xffffffc5
5 : 0xffffff9f
6 : 0xffffffc4
7 : 0xffffff9f
8 : 0xffffffc3
9 : 0xffffffbc
and strlen
is 10.
But if I use abcde
:
0 : 0x61
1 : 0x62
2 : 0x63
3 : 0x64
4 : 0x65
and strlen
is 5.
If I use wchar_t
for text output is
0 : 0xa7c3b6c3
1 : 0x9fc49fc5
2 : 0xbcc3
and strlen
is 10, wcslen
is 3.
回答1:
To go through each of the characters in the string, you can use mblen
. You also need to set the correct locale (the encoding represented by the multi byte string), so that mblen
can correctly parse the multi byte string.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
char * text = "öçşğü";
int i=0, char_len;
setlocale(LC_CTYPE, "en_US.utf8");
while ((char_len = mblen(&text[i], MB_CUR_MAX)) > 0)
{
/* &text[i] contains multibyte character of length char_len */
if(memcmp(&text[i], "ö", char_len) == 0)
{
printf("ö \n");
}
i += char_len;
}
return 0;
}
There are 2 types of string representation, using multi-byte (8-bit bytes) or wide byte (size depends on platform). Multi-byte representation has the advantage it can be represented using char *
(usual c string as in your code), but has disadvantage that multiple bytes represent a character. Wide string is represented using wchar_t *
. wchar_t
has the advantage that one wchar_t is one character (However as @anatolyg pointed out, this assumption can still go wrong in platforms where wchar_t is not able to represent all possible characters).
Have you looked at your source code using a hex editor? The string "öçşğü"
actually is represented by multi byte string c3 b6 c3 a7 c5 9f c4 9f c3 bc
in memory (UTF-8 encoding), of course with zero termination. You see 5 characters just because the string is rendered correctly by your UTF-8 aware viewer/browser. It is simple to realize that strlen(text)
returns 10 for this, whereas the above code loops only 5 times.
If you use wide-byte string, it can be done as explained by @WillBriggs.
回答2:
There are no standards surrounding embedding non-ASCII characters directly in your source file.
Instead, the C11 standard specifies that you can use Unicode code points:
wchar_t text[] = L"\u00f6\u00e7\u015f\u0131\u011f";
// Print whole string
wprintf(L"%s\n", text);
// Test individual characters
for (size_t i = 0; text[i]; ++i)
{
if ( text[i] == u'\u00f6' )
// whatever...
}
If you are in Windows then you face an extra problem that the Windows console can't print Unicode characters by default. You need to do the following:
- Change the console to use a TrueType monospaced font which includes glyphs for the characters you are trying to print. (I used "DejaVu Sans Mono" for this example)
- In the source code, call the function
_setmode(1, _O_WTEXT);
, which will need#include <fcntl.h>
.
To restore normal text afterwards you can _setmode(1, _O_TEXT);
.
Of course, if you are outputting to a file or to a Win32 API function then you don't need to do those steps.
回答3:
See wiki here: https://en.wikipedia.org/wiki/UTF-8 In particular, there is a table with the bit patterns.
Here's another way to scan/convert a utf-8 string into a codepoint
[not exact, just an example--refer to wiki]:
// utf8scan -- convert utf8 to codepoints (example)
char inpbuf[1000];
char uni[8];
typedef union {
char utf8[4];
unsigned int code;
} codepoint_t;
codepoint_t outbuf[1000];
// unidecode -- decode utf8 char into codepoint
// RETURNS: updated rhs pointer
char *
unidecode(codepoint_t *lhs,char *rhs)
{
int idx;
int chr;
idx = 0;
lhs->utf8[idx++] = *rhs++;
for (; ; ++rhs, ++idx) {
chr = *rhs;
// end of string
if (chr == 0)
break;
// start of new ascii char
if ((chr & 0x80) == 0)
break;
// start of new unicode char
if (chr & 0x40)
break;
lhs->utf8[idx] = chr;
}
return rhs;
}
// main -- main program
int
main(void)
{
char *rhs;
codepoint_t *lhs;
rhs = inpbuf;
lhs = outbuf;
for (; *rhs != 0; ++lhs) {
lhs->code = 0;
// ascii char
if ((*rhs & 0x80) == 0)
lhs->utf8[0] = *rhs++;
// get/skip unicode char
else
rhs = unidecode(lhs,rhs);
}
// add EOS
lhs->code = 0;
return 0;
}
回答4:
The best way to handle wide characters is as, well, wide characters.
wchar_t myWord[] = L"Something";
This will do it:
#include <stdio.h>
#include <ctype.h>
#include <string.h>
int main()
{
wchar_t * text = L"öçşğü";
int i = 0;
while (text[i])
{
if (text[i] == L'ö')
{
wprintf(L"ö \n");
}
i++;
}
return 0;
}
If you're in Visual Studio, like me, recall that the console window doesn't handle Unicode well. You can redirect it to a file and examine the file, and see the ö
.
来源:https://stackoverflow.com/questions/33737803/how-to-compare-multibyte-characters-in-c