Why am I getting the last octet repeated when my Perl program outputs a UTF-8 encoded string in cmd.exe?

前端 未结 1 1917
广开言路
广开言路 2020-12-05 10:43

Update

As @ikegami suggested, I reported this as a bug.

Bug #121783 for perl5: Windows: UTF-8 encoded output in cmd.exe with code page 65001 causes unexpec

相关标签:
1条回答
  • 2020-12-05 11:22

    The following program produces the correct output:

    use utf8;
    use strict;
    use warnings;
    use warnings qw(FATAL utf8);
    
    binmode(STDOUT, ":unix:encoding(utf8):crlf");
    
    print 'αβγxyz', "\n";
    

    Output:

    C:\…> chcp 65001
    Active code page: 65001
    C:\…> perl pttt.pl
    αβγxyz

    which seems to indicate to me there is some funkiness with the :crlf layer. I do not understand the internals enough to comment intelligently about this at this point.

    After many experiments, I have come to the conclusion that, if the console is already set to 65001 code page, binmode(STDOUT, ":unix:encoding(utf8):crlf"); will "work". However, note the following:

    binmode(STDOUT, ":unix:encoding(utf8):crlf");
    print Dump [
        map {
            my $x = defined($_) ? $_ : '';
            $x =~ s/\A([0-9]+)\z/sprintf '0x%08x', $1/eg;
            $x;
        } PerlIO::get_layers(STDOUT, details => 1)
    ];
    print "αβγxyz\n";
    

    gives me:

    ---
    - unix
    - ''
    - 0x01205200
    - crlf
    - ''
    - 0x00c85200
    - unix
    - ''
    - 0x01201200
    - encoding
    - utf8
    - 0x00c89200
    - crlf
    - ''
    - 0x00c8d200
    αβγxyz

    As before, I do not know enough to know the full consequences of this. I do intend to build a debug perl at some point to further diagnose this.

    I examined this a little further. Here are some observations from that post:

    The flags for the first unix layer are 0x01205200 = CANWRITE | TRUNCATE | CRLF | OPEN | NOTREG. Why is CRLF set for the unix layer on Windows? I do not know about the internals enough to understand this.

    However, the flags for the second unix layer, the one pushed by my explicit binmode, are 0x01201200 = 0x01205200 & ~CRLF. This is what would have made sense to me to begin with.

    The flags for the first crlf layer are 0x00c85200 = CANWRITE | TRUNCATE | CRLF | LINEBUF | FASTGETS | TTY. The flags for the second layer, which I push after the :encoding(utf8) layer are 0x00c8d200 = 0x00c85200 | UTF8.

    Now, if I open a file using open my $fh, '>:encoding(utf8)', 'ttt', and dump the same information, I get:

    ---
    - unix
    - ''
    - 0x00201200
    - crlf
    - ''
    - 0x00405200
    - encoding
    - utf8
    - 0x00409200

    As expected, the unix layer does not set the CRLF flag.

    0 讨论(0)
提交回复
热议问题