quickest way to select/copy lines containing string from huge txt.gz file

问题

So I have the following sed one liner:

sed -e '/^S|/d' -e '/^T|/d' -e '/^#D=/d' -e '/^##/d' -e 's/H|/,H|/g' -e 's/Q|/,,Q|/g' -e '1 i\,,,' sample_1.txt > sample_2.txt

I have many lines that start with either:

S|
T|
#D=
##
H|
Q|

The idea is to not copy the lines starting with one of the first fours and to replace H| (at the beginning of lines) by ,H| and Q| (at the beginning of lines) by ,,Q|

But now I would need to:

use the fastest way possible (internet suggests (m)awk is faster than sed)
read from a .txt.gz file and save the result in a .txt.gz file, avoiding, if possible, the intermediate un-zip/re-zip

there are in fact several hundreds .txt.gz files, each about ~1GB, to process in this way (all in the same folder). Is there a CLI way to run the code on parallel on all of them (so each core will get assigned a subset of the files in the directory)?

--I use linux --ubuntu

回答1:

Untested, but likely pretty close to this with GNU Parallel.

First make output directory so as not to overwrite any valuable data:

mkdir -p output

Now declare a function that does one file and export it to subprocesses so jobs started by GNU Parallel can find it:

doit(){
    echo Processing $1
    gzcat "$1" | awk '
        /^[ST]\|/ || /^#D=/ || /^##/ {next}    # ignore lines starting S|, T| etc
        /^H\|/ {print ","}                     # prefix "H|" with ","
        /^Q\|/ {print ",,"}                    # prefix "Q|" with ",,"
        1                                      # print all other lines
    ' | gzip > output/"$1"
}
export -f doit

Now process all txt.gz files in parallel and show progress bar too:

parallel --bar doit ::: *txt.gz

回答2:

Was something like this what you had in mind?

#!/bin/bash

export LC_ALL=C

zcat sample_1.txt.gz | gawk '
$1 !~ /^([ST]\||#D=|##)/ {
    switch ($0) {
        case /^H\|/:
            print "," $0
            break
        case /^Q\|/:
            print ",," $0
            break
        default:
            print $0
    }
}' | gzip > sample_2.txt.gz

The export LC_ALL=C tells your environment you aren't expecting extended characters, and can profoundly speed up execution. zcat expands and dumps a gz file to stdout. That is piped into gawk, which checks that the first part of each line does not match the first four character groupings you have in your question. For lines that pass that test, output to stdout (massaged as requested). As gawk executes, its stdout gets piped into gzip and written to a .txt.gz file.

It might be possible to use xargs with the -P and -n switches to parallelize your processing, but I think GNU parallel might be easier to work with.

来源：https://stackoverflow.com/questions/50915850/quickest-way-to-select-copy-lines-containing-string-from-huge-txt-gz-file

标签

Linux

Ubuntu

awk

sed

grep