How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?

前端 未结 3 1002
旧时难觅i
旧时难觅i 2021-01-03 08:36

On Python, there is this option errors=\'ignore\' for the open Python function:

open( \'/filepath.txt\',          


        
3条回答
  •  温柔的废话
    2021-01-03 09:00

    You are confusing what you see with what is really going on. The getline function does not do any replacement of characters. [Note 1]

    You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.

    Also, saying that a file contains the "characters Føö»BÃ¥r" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.

    What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character \xbb, which is a guillemot (») in Windows code page 1252.

    Finding all the invalid UTF-8 sequences in a string read by getline (or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but the mbtowc function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage, mbtowc returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.

    Here's some lightly-tested example code (in C):

    #include 
    #include 
    
    /* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
     * string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
     * If the length of the converted string is less than 'len', a NUL byte is
     * inserted.
     * Returns the length of the possibly modified string (with a maximum of 'len'),
     * not including the NUL terminator (if any).
     * Requires that a UTF-8 locale be active; since there is no way to test for
     * this condition, no attempt is made to do so. If the current locale is not UTF-8,
     * behaviour is undefined.
     */
    size_t remove_bad_utf8(char* s, size_t len) {
      char* in = s;
      /* Skip over the initial correct sequence. Avoid relying on mbtowc returning
       * zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
       */
      int seqlen;
      while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
      char* out = in;
    
      if (len && seqlen < 0) {
        ++in;
        --len;
        /* If we find an invalid sequence, we need to start shifting correct sequences.  */
        for (; len; in += seqlen, len -= seqlen) {
          seqlen = mbtowc(NULL, in, len);
          if (seqlen > 0) {
            /* Shift the valid sequence (if one was found) */
            memmove(out, in, seqlen);
            out += seqlen;
          }
          else if (seqlen < 0) seqlen = 1;
          else /* (seqlen == 0) */ break;
        }
        *out++ = 0;
      }
      return out - s;
    }
    

    Notes

    1. Aside from the possible line-end transformation of the underlying I/O library, which will replace CR-LF with a single \n on systems like Windows where the two character CR-LF sequence is used as a line-end indication.

提交回复
热议问题