Fixing a file consisting of both UTF-8 and Windows-1252

后端 未结 3 1092
遇见更好的自我
遇见更好的自我 2020-12-01 19:16

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp125

3条回答
  •  不知归路
    2020-12-01 19:40

    Yes!

    Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

    A line can contain a mix of encodings

    Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

    $ perl -e'
       use Encoding::FixLatin qw( fix_latin );
       $bytes = "\xD0 \x92 \xD0\x92\n";
       $text = fix_latin($bytes);
       printf("U+%v04X\n", $text);
    '
    U+00D0.0020.2019.0020.0412.000A
    

    Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

    • One of
      [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
      encoded using iso-8859-1 or cp1252, followed by one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    • One of
      [àáâãäåæçèéêëìíîï]
      encoded using iso-8859-1 or cp1252, followed by two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    • One of
      [ðñòóôõö÷]
      encoded using iso-8859-1 or cp1252, followed by two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
      encoded using iso-8859-1 or cp1252.

    The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

    $ perl -e'
       use Encode qw( decode_utf8 encode_utf8 decode );
       $bytes = "\xD0 \x92 \xD0\x92\n";
       $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
       printf("U+%v04X\n", $text);
    '
    U+00D0.0020.2019.0020.0412.000A
    

    Each line only uses one encoding

    fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

    $ perl -e'
       use Encode qw( decode );
       for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
          if (!eval {
             $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
             1  # No exception
          }) {
             $text = decode("cp1252", $bytes);
          }
    
          printf("U+%v04X\n", $text);
       }
    '
    U+00D0.0020.2019.0020.00D0.2019.000A
    U+0412.000A
    

    Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

    • The line is encoded using iso-8859-1 or cp1252,

    • At least one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
      is present in the line,

    • All instances of
      [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß]
      are always followed by exactly one of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • All instances of
      [àáâãäåæçèéêëìíîï]
      are always followed by exactly two of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • All instances of
      [ðñòóôõö÷]
      are always followed by exactly three of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿],

    • None of
      [øùúûüýþÿ]
      are present in the line, and

    • None of
      [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿]
      are present in the line except where previously mentioned.


    Notes:

    • Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
    • fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
    • The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.

提交回复
热议问题