I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp125
Recently I came across files with a severe mix of UTF-8, CP1252, and UTF-8 encoded, then interpreted as CP1252, then that encoded as UTF-8 again, that interpreted as CP1252 again, and so forth.
I wrote the below code, which worked well for me. It looks for typical UTF-8 byte sequences, even if some of the bytes are not UTF-8, but the Unicode representation of the equivalent CP1252 byte.
my %cp1252Encoding = (
# replacing the unicode code with the original CP1252 code
# see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
"\x{20ac}" => "\x80",
"\x{201a}" => "\x82",
"\x{0192}" => "\x83",
"\x{201e}" => "\x84",
"\x{2026}" => "\x85",
"\x{2020}" => "\x86",
"\x{2021}" => "\x87",
"\x{02c6}" => "\x88",
"\x{2030}" => "\x89",
"\x{0160}" => "\x8a",
"\x{2039}" => "\x8b",
"\x{0152}" => "\x8c",
"\x{017d}" => "\x8e",
"\x{2018}" => "\x91",
"\x{2019}" => "\x92",
"\x{201c}" => "\x93",
"\x{201d}" => "\x94",
"\x{2022}" => "\x95",
"\x{2013}" => "\x96",
"\x{2014}" => "\x97",
"\x{02dc}" => "\x98",
"\x{2122}" => "\x99",
"\x{0161}" => "\x9a",
"\x{203a}" => "\x9b",
"\x{0153}" => "\x9c",
"\x{017e}" => "\x9e",
"\x{0178}" => "\x9f",
);
my $re = join "|", keys %cp1252Encoding;
$re = qr/$re/;
my %cp1252Decoding = reverse % cp1252Encoding;
my $cp1252Characters = join "|", keys %cp1252Decoding;
sub decodeUtf8
{
my ($str) = @_;
$str =~ s/$re/ $cp1252Encoding{$&} /eg;
utf8::decode($str);
return $str;
}
sub fixString
{
my ($str) = @_;
my $r = qr/[\x80-\xBF]|$re/;
my $current;
do {
$current = $str;
# If this matches, the string is likely double-encoded UTF-8. Try to decode
$str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;
} while ($str ne $current);
# decodes any possible left-over cp1252 codes to Unicode
$str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
return $str;
}
This has similar limitations as ikegami's answer, except that the same limitations are also applicable to UTF-8 encoded strings.