How to read unicode (utf-8) / binary file line by line

前端 未结 6 1702
醉酒成梦
醉酒成梦 2020-12-16 15:56

Hi programmers,

I want read line by line a Unicode (UTF-8) text file created by Notepad, i don\'t want display the Unicode string in the screen, i w

相关标签:
6条回答
  • 2020-12-16 16:28

    In this article a coding and decoding routine is written and it is explained how the unicode is encoded:

    http://www.codeguru.com/cpp/misc/misc/multi-lingualsupport/article.php/c10451/

    It can be easily adjusted to C. Simply encode your ANSI or decode the UTF-8 String and make a byte compare

    EDIT: After the OP said that it is too hard to rewrite the function from C++ here a template:

    What is needed:
    + Free the allocated memory (or wait till the process ends or ignore it)
    + Add the 4 byte functions
    + Tell me that short and int is not guaranteed to be 2 and 4 bytes long (I know, but C is really stupid !) and finally
    + Find some other errors

    #include <stdlib.h>
    #include <string.h>
    
    #define         MASKBITS                0x3F
    #define         MASKBYTE                0x80
    #define         MASK2BYTES              0xC0
    #define         MASK3BYTES              0xE0
    #define         MASK4BYTES              0xF0
    #define         MASK5BYTES              0xF8
    #define         MASK6BYTES              0xFC
    
    char* UTF8Encode2BytesUnicode(unsigned short* input)
    {
       int size = 0,
           cindex = 0;
       while (input[size] != 0)
         size++;
       // Reserve enough place; The amount of 
       char* result = (char*) malloc(size);
       for (int i=0; i<size; i++)
       {
          // 0xxxxxxx
          if(input[i] < 0x80)
          {
             result[cindex++] = ((char) input[i]);
          }
          // 110xxxxx 10xxxxxx
          else if(input[i] < 0x800)
          {
             result[cindex++] = ((char)(MASK2BYTES | input[i] >> 6));
             result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
          }
          // 1110xxxx 10xxxxxx 10xxxxxx
          else if(input[i] < 0x10000)
          {
             result[cindex++] = ((char)(MASK3BYTES | input[i] >> 12));
             result[cindex++] = ((char)(MASKBYTE | input[i] >> 6 & MASKBITS));
             result[cindex++] = ((char)(MASKBYTE | input[i] & MASKBITS));
          }
       }
    }
    
    wchar_t* UTF8Decode2BytesUnicode(char* input)
    {
      int size = strlen(input);
      wchar_t* result = (wchar_t*) malloc(size*sizeof(wchar_t));
      int rindex = 0,
          windex = 0;
      while (rindex < size)
      {
          wchar_t ch;
    
          // 1110xxxx 10xxxxxx 10xxxxxx
          if((input[rindex] & MASK3BYTES) == MASK3BYTES)
          {
             ch = ((input[rindex] & 0x0F) << 12) | (
                   (input[rindex+1] & MASKBITS) << 6)
                  | (input[rindex+2] & MASKBITS);
             rindex += 3;
          }
          // 110xxxxx 10xxxxxx
          else if((input[rindex] & MASK2BYTES) == MASK2BYTES)
          {
             ch = ((input[rindex] & 0x1F) << 6) | (input[rindex+1] & MASKBITS);
             rindex += 2;
          }
          // 0xxxxxxx
          else if(input[rindex] < MASKBYTE)
          {
             ch = input[rindex];
             rindex += 1;
          }
    
          result[windex] = ch;
       }
    }
    
    char* getUnicodeToUTF8(wchar_t* myString) {
      int size = sizeof(wchar_t);
      if (size == 1)
        return (char*) myString;
      else if (size == 2)
        return UTF8Encode2BytesUnicode((unsigned short*) myString);
      else
        return UTF8Encode4BytesUnicode((unsigned int*) myString);
    }
    
    0 讨论(0)
  • 2020-12-16 16:34

    I know I am bad... but you don't even take under consideration BOM! Most examples here will fail.

    EDIT:

    Byte Order Marks are a few bytes at the beginnig of the file, which can be used to identify the encoding of the file. Some editors add them, and many times they just break things in faboulous ways (I remember fighting a PHP headers problems for several minutes because of this issue).

    Some RTFM: http://en.wikipedia.org/wiki/Byte_order_mark http://blogs.msdn.com/oldnewthing/archive/2004/03/24/95235.aspx What is XML BOM and how do I detect it?

    0 讨论(0)
  • 2020-12-16 16:43

    A nice property of UTF-8 is that you do not need to decode in order to compare it. The order returned from strcmp will be the same whether you decode it first or not. So just read it as raw bytes and run strcmp.

    0 讨论(0)
  • 2020-12-16 16:44

    fgets() can decode UTF-8 encoded files if you use Visual Studio 2005 and up. Change your code like this:

    infile = fopen(inname, "r, ccs=UTF-8");
    
    0 讨论(0)
  • 2020-12-16 16:46

    just to settle the BOM argument. Here is a file from notepad

     [paul@paul-es5 tests]$ od -t x1 /mnt/hgfs/cdrive/test.txt
     0000000 ef bb bf 61 0d 0a 62 0d 0a 63
     0000012
    

    with a BOM at the start

    Personally I dont think there should be a BOM (since its a byte format) but thats not the point

    0 讨论(0)
  • 2020-12-16 16:48

    I found a solution to my problem, and I would like to share the solution to any one interested in reading UTF-8 file in C99.

    void ReadUTF8(FILE* fp)
    {
        unsigned char iobuf[255] = {0};
        while( fgets((char*)iobuf, sizeof(iobuf), fp) )
        {
                size_t len = strlen((char *)iobuf);
                if(len > 1 &&  iobuf[len-1] == '\n')
                    iobuf[len-1] = 0;
                len = strlen((char *)iobuf);
                printf("(%d) \"%s\"  ", len, iobuf);
                if( iobuf[0] == '\n' )
                    printf("Yes\n");
                else
                    printf("No\n");
        }
    }
    
    void ReadUTF16BE(FILE* fp)
    {
    }
    
    void ReadUTF16LE(FILE* fp)
    {
    }
    
    int main()
    {
        FILE* fp = fopen("test_utf8.txt", "r");
        if( fp != NULL)
        {
            // see http://en.wikipedia.org/wiki/Byte-order_mark for explaination of the BOM
            // encoding
            unsigned char b[3] = {0};
            fread(b,1,2, fp);
            if( b[0] == 0xEF && b[1] == 0xBB)
            {
                fread(b,1,1,fp); // 0xBF
                ReadUTF8(fp);
            }
            else if( b[0] == 0xFE && b[1] == 0xFF)
            {
                ReadUTF16BE(fp);
            }
            else if( b[0] == 0 && b[1] == 0)
            {
                fread(b,1,2,fp); 
                if( b[0] == 0xFE && b[1] == 0xFF)
                    ReadUTF16LE(fp);
            }
            else
            {
                // we don't know what kind of file it is, so assume its standard
                // ascii with no BOM encoding
                rewind(fp);
                ReadUTF8(fp);
            }
        }        
    
        fclose(fp);
    }
    
    0 讨论(0)
提交回复
热议问题