My Perl program takes some text from a disk file as input, wraps it in some XML, then outputs it to STDOUT. The input is nominally UTF-8, but sometimes has junk inserted. I
You should read the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are listed as legal for interchange by the Unicode standard.
"\xEF\xBF\xBE"
, when interpreted as UTF-8, decodes to the codepoint U+FFFE. But that's not legal for interchange according to Unicode, so programs that are strict about such things complain.
Instead of using decode_utf8
(which uses the lax utf8
encoding), use decode
with the utf-8
encoding. And read the Handling Malformed Data section to see the different ways you can handle or complain about problems.
Update: It does appear that some versions of Perl don't complain about U+FFFE, even when using the utf-8-strict
encoding. This appears to be a bug. You may just have to build a list of codepoints that Sphinx complains about and filter them out manually (e.g. with tr).
You have a utf8 string containing some invalid utf8...
This replaces it with a default 'bad char'.
use Encode qw(decode encode);
my $octets = decode('UTF-8', $malformed_utf8, Encode::FB_DEFAULT);
my $good_utf8 = encode('UTF-8', $octets, Encode::FB_CROAK);