How do I sanitize invalid UTF-8 in Perl?

前端 未结 2 1181
生来不讨喜
生来不讨喜 2020-12-05 10:26

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I

相关标签:
2条回答
  • 2020-12-05 11:14

    You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

    To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

    The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.

    "\xEF\xBF\xBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.

    Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.

    Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

    0 讨论(0)
  • 2020-12-05 11:19

    You have a utf8 string containing some invalid utf8...

    This replaces it with a default 'bad char'.

    use Encode qw(decode encode);
    
    my $octets    = decode('UTF-8', $malformed_utf8, Encode::FB_DEFAULT);
    
    my $good_utf8 = encode('UTF-8', $octets,         Encode::FB_CROAK);
    
    0 讨论(0)
提交回复
热议问题