quickest way to select/copy lines containing string from huge txt.gz file

喜你入骨 提交于 2021-02-19 04:21:49

问题


So I have the following sed one liner:

sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt

I have many lines that start with either:

  • S|
  • T|
  • #D=
  • ##
  • H|
  • Q|

The idea is to not copy the lines starting with one of the first fours and to replace H| (at the beginning of lines) by ,H| and Q| (at the beginning of lines) by ,,Q|

But now I would need to:

  • use the fastest way possible (internet suggests (m)awk is faster than sed)
  • read from a .txt.gz file and save the result in a .txt.gz file, avoiding, if possible, the intermediate un-zip/re-zip

there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?

--I use linux --ubuntu


回答1:


Untested, but likely pretty close to this with GNU Parallel.

First make output directory so as not to overwrite any valuable data:

mkdir -p output

Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:

doit(){
    echo Processing $1
    gzcat "$1" | awk '
        /^[ST]\|/ || /^#D=/ || /^##/ {next}    # ignore lines starting S|, T| etc
        /^H\|/ {print ","}                     # prefix "H|" with ","
        /^Q\|/ {print ",,"}                    # prefix "Q|" with ",,"
        1                                      # print all other lines
    ' | gzip > output/"$1"
}
export -f doit

Now process all txt.gz files in parallel and show progress bar too:

parallel --bar doit ::: *txt.gz



回答2:


Was something like this what you had in mind?

#!/bin/bash

export LC_ALL=C

zcat sample_1.txt.gz | gawk '
$1 !~ /^([ST]\||#D=|##)/ {
    switch ($0) {
        case /^H\|/:
            print "," $0
            break
        case /^Q\|/:
            print ",," $0
            break
        default:
            print $0
    }
}' | gzip > sample_2.txt.gz

The export LC_ALL=C tells your environment you aren't expecting extended characters, and can profoundly speed up execution. zcat expands and dumps a gz file to stdout. That is piped into gawk, which checks that the first part of each line does not match the first four character groupings you have in your question. For lines that pass that test, output to stdout (massaged as requested). As gawk executes, its stdout gets piped into gzip and written to a .txt.gz file.

It might be possible to use xargs with the -P and -n switches to parallelize your processing, but I think GNU parallel might be easier to work with.



来源:https://stackoverflow.com/questions/50915850/quickest-way-to-select-copy-lines-containing-string-from-huge-txt-gz-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!