Fixing a file consisting of both UTF-8 and Windows-1252

后端未结

关注

 3  1092

遇见更好的自我 2020-12-01 19:16

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp125

3条回答

不知归路 (楼主)

2020-12-01 19:40
Yes!

Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

A line can contain a mix of encodings

Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.
```
$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = fix_latin($bytes);
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
```
Heuristics are employed, but they are fairly reliable. Only the following cases will fail:
- One of
  [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
  encoded using iso-8859-1 or cp1252, followed by one of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
  encoded using iso-8859-1 or cp1252.
- One of
  [àáâãäåæçèéêëìíîï]
  encoded using iso-8859-1 or cp1252, followed by two of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
  encoded using iso-8859-1 or cp1252.
- One of
  [ðñòóôõö÷]
  encoded using iso-8859-1 or cp1252, followed by two of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
  encoded using iso-8859-1 or cp1252.
The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.
```
$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
```
Each line only uses one encoding

fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.
```
$ perl -e'
   use Encode qw( decode );
   for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X\n", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A
```
Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:
- The line is encoded using iso-8859-1 or cp1252,
- At least one of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
  is present in the line,
- All instances of
  [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
  are always followed by exactly one of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of
  [àáâãäåæçèéêëìíîï]
  are always followed by exactly two of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],
- All instances of
  [ðñòóôõö÷]
  are always followed by exactly three of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],
- None of
  [øùúûüýþÿ]
  are present in the line, and
- None of
  [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
  are present in the line except where previously mentioned.
Notes:
- Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
- fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
- The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

Fixing a file consisting of both UTF-8 and Windows-1252

A line can contain a mix of encodings

Each line only uses one encoding