How do I sanitize invalid UTF-8 in Perl?

前端未结

关注

 2  1181

My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I

相关标签:

2条回答

伪装坚强ぢ

2020-12-05 11:14

You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are listed as legal for interchange by the Unicode standard.

"\xEF\xBF\xBE", when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.

Instead of using decode_utf8 (which uses the lax utf8 encoding), use decode with the utf-8 encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.

Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).

0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-05 11:19
You have a utf8 string containing some invalid utf8...

This replaces it with a default 'bad char'.
```
use Encode qw(decode encode);

my $octets    = decode('UTF-8', $malformed_utf8, Encode::FB_DEFAULT);

my $good_utf8 = encode('UTF-8', $octets,         Encode::FB_CROAK);
```
0 讨论(0)
发布评论:

提交评论
- 加载中...