Why is my Perl program failing with Tie::File and Unicode/UTF-8 encoding?

僤鯓⒐⒋嵵緔 提交于 2019-12-01 17:46:36

The suggestion I would make depends very much on the actual problem you're trying to solve. Looking at this question in isolation, I would not have so much encoding / decoding 'magic' and would simply use the raw bytes (as the script doesn't need to know anything about the characters themselves for this task). The below produces the expected result given the input and output you described.

use v5.014;
use warnings;
use autodie;

use Carp::Always;
use Tie::File;

my $file_in = 'test_in.txt';
my $file_out = 'test_tie.txt';

unlink $file_out;

tie my @tied, 'Tie::File', $file_out, recsep => "\x0D\x0A" or die 'tie failed';

open my $fh, '<', $file_in;
while (my $line = <$fh>) {
    chomp $line;
    push @tied, $line;
}
close $fh;

my $i = 0;
say $i++ . ' ' . $_ foreach @tied;

untie @tied;

However, you probably do want to do some processing on that text in the middle. In which case you want decoded characters. As I see it there are two options:

  1. Encode manually before handing off to the tied array
  2. Figure out what the issue is with Tie::File

Number 2 is probably non-trivial - a quick scan of the Tie::File source and it looks like it assumes it will always be given bytes. The only part that you can seemingly affect is the binmode at https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111 - which you are doing.

Tie::File does a lot of seek calls, perldoc has this to say on seek ( http://perldoc.perl.org/functions/seek.html ):

Note the in bytes: even if the filehandle has been set to operate on characters (for example by using the :encoding(utf8) open layer), tell() will return byte offsets, not character offsets (because implementing that would render seek() and tell() rather slow).

So it appears that Tie::File is using character lengths to determine its byte offsets for records. Therefore it can end up in the middle of a UTF-8 character sequence. This seems a likely cause for your errors.

In general, I stay away from binmode when relying on an external module to read/write to a file handle - in this case I would have a simple sub calling Encode::encode('UTF-8', ...) on the data before pushing onto @tied.

Exception is where the module's documentation clearly states the behaviour for decoded data or if the source is simple enough for me to verify the behaviour.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!