How do I sanitize invalid UTF-8 in Perl?

前端 未结 2 1180
生来不讨喜
生来不讨喜 2020-12-05 10:26

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I

2条回答
  •  伪装坚强ぢ
    2020-12-05 11:14

    You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

    To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

    The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.

    "\xEF\xBF\xBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.

    Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.

    Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

提交回复
热议问题