split STDIN to multiple files (and compress them if possible)

假如想象 提交于 2019-12-07 01:48:55

问题


I have program (gawk) that outputs stream of data to its STDOUT. The data processed is literally 10s of GBs. I don't want to persist it in a single file but rather split it into chunks and potentially apply some extra processing (like compression) to each before saving.

my data is a sequence of records and I don't want splitting to cut record in half. Each record matches the following regexp:

^\{index.+?\}\}\n\{.+?\}$

or for simplicity can assume that two rows (first uneven then even when numbering from beginning of the stream) always make a record.

Can I:

  • use some standard linux command to split STDIN by defining preferable size of chunk? Doesn't need to be exact given record variable size can't guarantee it. Alternatively just number of records if def. by size is impossible
  • compress each chunk and store in a file (with some numbering in its name like 001, 002 etc..)

I've became aware of commands like GNU parallel or csplit but don't know how to put it together. Would be nice if functionality explained above could achieved without writing custom perl script for it. This however could be another, last resort solution but again, not sure how to best implement it.


回答1:


GNU Parallel can split stdin into chunks of records. This will split stdin into 50 MB chunks with each record being 2 lines. Each chunk will be passed to gzip and compressed to the name [chunk number].gz:

cat big | parallel -l2 --pipe --block 50m gzip ">"{#}.gz

If you know your second line will never start with '{index' you can use '{index' as the record start:

cat big | parallel --recstart '{index' --pipe --block 50m gzip ">"{#}.gz

You can then easily test if the splitting went correctly by:

parallel zcat {} \| wc -l ::: *.gz

Unless your records are all the same length you will probably see a different number of lines, but all even.

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial (man parallel_tutorial). You command line will love you for it.

|improve this answer

来源:https://stackoverflow.com/questions/22628610/split-stdin-to-multiple-files-and-compress-them-if-possible

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!