Properly detect line-endings of a file in Perl?

跟風遠走 提交于 2019-12-23 09:56:25

问题


Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:

while (<$fh>){
    tr/\r\n//d;
    my @fields = split /,/, $_;
    # ...
}

On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.

But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?

I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.

Any help?

Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.


回答1:


You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.

use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );

open( $fh, '<:crlf', 'data.csv' ) or die $!;

while ( my $row = $csv->getline( $fh ) ) {
     # do something with $row
}



回答2:


Since Perl 5.10, you can use this to check general line endings,

s/\R//g;

It should work in all cases, both *nix and Windows.




回答3:


Read in the first line of each file, look at its last but one character. If it is \r, the file comes from Windows, if not, it is *nix. Then seek to the begin and start processing.

If it is possible for a file to have mixed line endings (e.g. different type for embeded newlines), you can only guess.




回答4:


In theory line endings cannot be determined reliably: Is this file a single line with DOS line endings with embeded \ns or is this a bunch of lines with a few stray \r characters at the end of some lines?

foo\n
ba\r\n

versus

foo\nba\r\n

If statistical analysis is not an option because it is too inaccurate and expensive (it takes time to scan such huge files), you have to actually know what the encoding is.

It would be best to specify the exact file format if you have control over the producing applications or to use some kind of metadata to keep track of the platform the data was produced on.

In Perl, the character \n represents is locale dependent: \n/\012 on *nix machines, \r/\015 on old Macs and the sequence \r\n/\015\012 on DOS-descendants aka Windows. So to do reliable processing, you should use the octal values.




回答5:


You can use the PERLIO variable. This has the advantage of not having to modify the source code of your scripts depending on the platform.

If you're dealing with DOS text files, set the environment variable PERLIO to :unix:crlf:

$ PERLIO=:unix:crlf my-script.pl dos-text-file.txt

If you're mainly dealing with DOS text files (e.g. on Cygwin), you could put this in your .bashrc:

export PERLIO=:unix:crlf

(I think that value should be the default for PERLIO on Cygwin, but apparently it's not.)



来源:https://stackoverflow.com/questions/12168282/properly-detect-line-endings-of-a-file-in-perl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!