问题
Greetings,
I've taken over from a prior team and writing ETL jobs which process csv files. I use a combination of shell scripts and perl on ubuntu. The csv files are huge; they arrive as zipped archives. Unzipped, many are more than 30Gb - yes, that's a G
Legacy process is a batch job running on cron that unzips each file entirely, reads and copies the first line of it into a config file, then re-zips the entire file. Some days this takes many many hours of processing time, for no benefit.
Can you suggest a method to only extract the first line (or first few lines) from each file inside a zipped archive, without fully unpacking the archives?
回答1:
The unzip command line utility has a -p
option which dumps a file to standard out. Just pipe that into head and it'll not bother extracting the whole file to disk.
Alternatively, from perldoc IO::Compress::Zip:
my ($status, $bufferRef);
my $member = $zip->memberNamed( 'xyz.txt' );
$member->desiredCompressionMethod( COMPRESSION_STORED );
$status = $member->rewindData();
die "error $status" unless $status == AZ_OK;
while ( ! $member->readIsDone() )
{
( $bufferRef, $status ) = $member->readChunk();
die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END;
# do something with $bufferRef:
print $$bufferRef;
}
$member->endRead();
Modify to suit, i.e. by iterating over the file list $zip->memberNames()
, and only reading the first few lines.
回答2:
Python's zipfile.ZipFile allows you to access archived files as streams via ZipFile.open(). From there you can process them as necessary.
来源:https://stackoverflow.com/questions/3807761/run-head-on-a-text-file-inside-a-zipped-archive-without-unpacking-the-archive