How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?

前端未结

关注

 3  1006

旧时难觅i 2021-01-03 08:36

On Python, there is this option errors=\'ignore\' for the open Python function:

open( \'/filepath.txt\',


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   轮回少年
                                             
                
                
                (楼主)
            
              
              
                2021-01-03 08:57
              

            
            
                        
As @rici well explains in his answer, there can be several invalid UTF-8 sequences in a byte sequence.

Possibly iconv(3) could be worth a look, e.g. see https://linux.die.net/man/3/iconv_open.


  When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.


Example

This byte sequence, if interpreted as UTF-8, contains some invalid UTF-8:

"some invalid\xFE\xFE\xFF\xFF stuff"


If you display this you would see something like 

some invalid���� stuff


When this string passes through the remove_invalid_utf8 function in the following C program, the invalid UTF-8 bytes are removed using the iconv function mentioned above.

So the result is then:

some invalid stuff


C Program

#include 
#include 
#include 
#include 
#include 
#include 

char *remove_invalid_utf8(char *utf8, size_t len) {
    size_t inbytes_len = len;
    char *inbuf = utf8;

    size_t outbytes_len = len;
    char *result = calloc(outbytes_len + 1, sizeof(char));
    char *outbuf = result;

    iconv_t cd = iconv_open("UTF-8//IGNORE", "UTF-8");
    if(cd == (iconv_t)-1) {
        perror("iconv_open");
    }
    if(iconv(cd, &inbuf, &inbytes_len, &outbuf, &outbytes_len)) {
        perror("iconv");
    }
    iconv_close(cd);
    return result;
}

int main() {
    char *utf8 = "some invalid\xFE\xFE\xFF\xFF stuff";
    char *converted = remove_invalid_utf8(utf8, strlen(utf8));
    printf("converted: %s to %s\n", utf8, converted);
    free(converted);
    return 0;
}

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复