Handling special characters in C (UTF-8 encoding)

后端 未结 4 1453
天涯浪人
天涯浪人 2020-12-07 21:17

I\'m writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like

相关标签:
4条回答
  • 2020-12-07 21:55

    First things first:

    1. Read in the buffer
    2. Use libiconv or similar to obtain wchar_t type from UTF-8 and use the wide character handling functions such as wprintf()
    3. Use the wide character functions in C! Most file/output handling functions have a wide-character variant

    Ensure that your terminal can handle UTF-8 output. Having the correct locale setup and manipulating the locale data can automate alot of the file opening and conversion for you ... depending on what you are doing.

    Remember that the width of a code-point or character in UTF-8 is variable. This means you can't just seek to a byte and begin reading like with ASCII ... because you might land in the middle of a code point. Good libraries can do this in some cases.

    Here is some code (not mine) that demonstrates some usage of UTF-8 file reading and wide character handling in C.

    #include <stdio.h>
    #include <wchar.h>
    int main()
    {
        FILE *f = fopen("data.txt", "r, ccs=UTF-8");
        if (!f)
            return 1;
    
        for (wint_t c; (c = fgetwc(f)) != WEOF;)
            printf("%04X\n", c);
    
        fclose(f);
        return 0;
    }
    

    Links

    1. libiconv
    2. Locale data in C/GNU libc
    3. Some handy info
    4. Another good Unicode/UTF-8 in C resource
    0 讨论(0)
  • 2020-12-07 22:01

    Probably your text file is ISO-8559-1 encoded but your terminal is UTF-8. This kind of mismatch is a standard problem when dealing with byte-oriented text handling; other C programs (such as the standard ‘cat’ and ‘more’ commands) will do the same thing and it isn't generally considered an error or something that needs to be fixed.

    If you want to operate on a Unicode character level instead of bytes that's fine, but you'll need to use wchar as your character type instead of char throughout your program, and provide switches for the user to specify what the incoming file encoding actually is. (Whilst it is sometimes possible to guess, it's not very reliable.)

    0 讨论(0)
  • 2020-12-07 22:06

    Make sure you're not accidentally dropping any bytes; some UTF-8 characters are more than one byte in length (that's sort of the point), and you need to keep them all.

    It can be useful to print the contents of the buffer as hex, so you can inspect which bytes are actually read:

    static void print_buffer(const char *buffer, size_t length)
    {
      size_t i;
    
      for(i = 0; i < length; i++)
        printf("%02x ", (unsigned int) buffer[i]);
      putchar('\n');
    }
    

    You can do this after loading a very short file, containing just a few characters.

    Also make sure the terminal is set to the proper encoding, so it interprets your characters as UTF-8.

    0 讨论(0)
  • 2020-12-07 22:11

    I don't know if it could help but if you're sure that the encodings of terminal and input file are the same, you can try to setlocale():

    #include <locale.h>
    …
    setlocale(LC_CTYPE, "");
    
    0 讨论(0)
提交回复
热议问题