Run `head` on a text file inside a zipped archive without unpacking the archive

谁说我不能喝 提交于 2019-12-23 15:22:03

问题


Greetings,

I've taken over from a prior team and writing ETL jobs which process csv files. I use a combination of shell scripts and perl on ubuntu. The csv files are huge; they arrive as zipped archives. Unzipped, many are more than 30Gb - yes, that's a G

Legacy process is a batch job running on cron that unzips each file entirely, reads and copies the first line of it into a config file, then re-zips the entire file. Some days this takes many many hours of processing time, for no benefit.

Can you suggest a method to only extract the first line (or first few lines) from each file inside a zipped archive, without fully unpacking the archives?


回答1:


The unzip command line utility has a -p option which dumps a file to standard out. Just pipe that into head and it'll not bother extracting the whole file to disk.

Alternatively, from perldoc IO::Compress::Zip:

my ($status, $bufferRef);
my $member = $zip->memberNamed( 'xyz.txt' );
$member->desiredCompressionMethod( COMPRESSION_STORED );
$status = $member->rewindData();
die "error $status" unless $status == AZ_OK;
while ( ! $member->readIsDone() )
{
   ( $bufferRef, $status ) = $member->readChunk();
   die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END;
   # do something with $bufferRef:
   print $$bufferRef;
}
$member->endRead();

Modify to suit, i.e. by iterating over the file list $zip->memberNames(), and only reading the first few lines.




回答2:


Python's zipfile.ZipFile allows you to access archived files as streams via ZipFile.open(). From there you can process them as necessary.



来源:https://stackoverflow.com/questions/3807761/run-head-on-a-text-file-inside-a-zipped-archive-without-unpacking-the-archive

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!