How to replace/ignore invalid Unicode/UTF8 characters � from C stdio.h getline()?

前端 未结 3 1006
旧时难觅i
旧时难觅i 2021-01-03 08:36

On Python, there is this option errors=\'ignore\' for the open Python function:

open( \'/filepath.txt\',          


        
3条回答
  •  轮回少年
    2021-01-03 08:57

    As @rici well explains in his answer, there can be several invalid UTF-8 sequences in a byte sequence.

    Possibly iconv(3) could be worth a look, e.g. see https://linux.die.net/man/3/iconv_open.

    When the string "//IGNORE" is appended to tocode, characters that cannot be represented in the target character set will be silently discarded.

    Example

    This byte sequence, if interpreted as UTF-8, contains some invalid UTF-8:

    "some invalid\xFE\xFE\xFF\xFF stuff"
    

    If you display this you would see something like

    some invalid���� stuff
    

    When this string passes through the remove_invalid_utf8 function in the following C program, the invalid UTF-8 bytes are removed using the iconv function mentioned above.

    So the result is then:

    some invalid stuff
    

    C Program

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    char *remove_invalid_utf8(char *utf8, size_t len) {
        size_t inbytes_len = len;
        char *inbuf = utf8;
    
        size_t outbytes_len = len;
        char *result = calloc(outbytes_len + 1, sizeof(char));
        char *outbuf = result;
    
        iconv_t cd = iconv_open("UTF-8//IGNORE", "UTF-8");
        if(cd == (iconv_t)-1) {
            perror("iconv_open");
        }
        if(iconv(cd, &inbuf, &inbytes_len, &outbuf, &outbytes_len)) {
            perror("iconv");
        }
        iconv_close(cd);
        return result;
    }
    
    int main() {
        char *utf8 = "some invalid\xFE\xFE\xFF\xFF stuff";
        char *converted = remove_invalid_utf8(utf8, strlen(utf8));
        printf("converted: %s to %s\n", utf8, converted);
        free(converted);
        return 0;
    }
    

提交回复
热议问题