Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

你。 提交于 2019-12-25 01:16:17

问题


I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing. Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt

Hex dump of input for testing (two lines: "a" and "b" letters on each): FF FE 61 00 0A 00 62 00 0A 00

processing like s/b/c/g should give an output ("b" replaced with "c"): FF FE 61 00 0A 00 63 00 0A 00

PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.


回答1:


The best I've come up with is this:

perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt

But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.

The difference between <infile.txt and infile.txt is in how and when the files are opened. With <infile.txt, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.

When you use infile.txt, the filename is passed as a command line argument and placed in the @ARGV array. When the BEGIN block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:

use open qw(:std IO :raw:encoding(UTF-16LE));

and have the magic <ARGV> processing apply the right encoding. But I haven't been able to get that to work right in this case.



来源:https://stackoverflow.com/questions/9447653/stream-process-utf-16-file-with-bom-and-unix-line-endings-in-windows-perl

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!