How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?

前端未结

关注

 3  1002

旧时难觅i 2021-01-03 08:36

On Python, there is this option errors=\'ignore\' for the open Python function:

open( \'/filepath.txt\',


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   温柔的废话
                                             
                
                
                (楼主)
            
              
              
                2021-01-03 09:00
              

            
            
                        
You are confusing what you see with what is really going on. The getline function does not do any replacement of characters. [Note 1]
You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.
Also, saying that a file contains the "characters FÃ¸Ã¶»BÃ¥r" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.
What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character \xbb, which is a guillemot (») in Windows code page 1252.
Finding all the invalid UTF-8 sequences in a string read by getline (or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but the mbtowc function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage, mbtowc returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.
Here's some lightly-tested example code (in C):
#include 
#include 

/* Removes in place any invalid UTF-8 sequences from at most 'len' characters of the
 * string pointed to by 's'. (If a NUL byte is encountered, conversion stops.)
 * If the length of the converted string is less than 'len', a NUL byte is
 * inserted.
 * Returns the length of the possibly modified string (with a maximum of 'len'),
 * not including the NUL terminator (if any).
 * Requires that a UTF-8 locale be active; since there is no way to test for
 * this condition, no attempt is made to do so. If the current locale is not UTF-8,
 * behaviour is undefined.
 */
size_t remove_bad_utf8(char* s, size_t len) {
  char* in = s;
  /* Skip over the initial correct sequence. Avoid relying on mbtowc returning
   * zero if n is 0, since Posix is not clear whether mbtowc returns 0 or -1.
   */
  int seqlen;
  while (len && (seqlen = mbtowc(NULL, in, len)) > 0) { len -= seqlen; in += seqlen; }
  char* out = in;

  if (len && seqlen < 0) {
    ++in;
    --len;
    /* If we find an invalid sequence, we need to start shifting correct sequences.  */
    for (; len; in += seqlen, len -= seqlen) {
      seqlen = mbtowc(NULL, in, len);
      if (seqlen > 0) {
        /* Shift the valid sequence (if one was found) */
        memmove(out, in, seqlen);
        out += seqlen;
      }
      else if (seqlen < 0) seqlen = 1;
      else /* (seqlen == 0) */ break;
    }
    *out++ = 0;
  }
  return out - s;
}


Notes

Aside from the possible line-end transformation of the underlying I/O library, which will replace CR-LF with a single \n on systems like Windows where the two character CR-LF sequence is used as a line-end indication.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复