Make awk efficient (again)

问题

I have the code below, which works successfully (kudos to @EdMorton), and is used to parse, clean log files (very large in size) and output into smaller sized files. Output filename is the first 2 characters of each line. However, if there is a special character in these 2 characters, then it needs to be replaced with a '_'. This will help ensure there is no illegal character in the filename.

Next, it checks if any of the output files are large than a certain size, if so, that file is sub-split by the 3rd character.

This would take about 10 mins to process 1 GB worth of logs (on my laptop). Can this be made faster? Any help will be appreciated.

Sample log file

"email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com;dtat'ah'ere2 
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ
email3@foo.com:datahere2

Expected Output

# cat em 
email1@foo.com:datahere2     
email2@foo.com:datahere2
email3@foo.com:datahere2
email5@foo.com:dtat'ah'ere2 
email3@foo.com:datahere2

# cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ

Code:

#/usr/bin/env bash
Func_Clean(){
pushd $1 > /dev/null
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print >> "_leftover"
            }
        } 
    ' * |
    sort -t':' -k1,1 |
    awk '
        { curr = tolower(substr($0,1,2)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath  # Throws an error
        } ' && rm *.txt

    find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |while read FILE; do
    awk '
        { curr = tolower(substr($0,1,3)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { 
            print >> Fpath
            # print | "gzip -9 -f >> " Fpath   # Throws an error
        } ' "$FILE" && rm "$FILE"
    done

    #gzip -9 -f -r .    # This would work, but is it effecient?
popd > /dev/null
}

### MAIN - Starting Point ###
BASE_FOLDER="_test2"
for dir in $(find $BASE_FOLDER -type d); 
do
    if [ $dir != $BASE_FOLDER ]; then
        echo $dir
        time Func_Clean "$dir"
    fi
done

回答1:

Wrt the subject Make awk efficient (again) - awk is extremely efficient, you're looking for ways to make your particular awk scripts more efficient and to make your shell script that calls awk more efficient.

The only obvious performance improvements I see are:

Change:

find * -type f -prune -size +1000000c \( ! -iname "_leftover" \) |
while read FILE; do
    awk 'script' "$FILE" && rm "$FILE"
done

to something like (untested):

readarray -d '' files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
awk 'script' "${files[@]}" &&
rm -f "${files[@]}"

so you call awk once total instead of once per file.

Call Func_Clean() once total for all files in all directories instead of once per directory.
Use GNU parallel or similar to run Func_Clean() on all directories in parallel.

I see you're considering piping the output to gzip to save space, that's fine but just be aware that will cost you something (idk how much) in terms of execution time. Also if you do that then you need to close the whole output pipeline as that is what you're writing to from awk, not just the file at the end of it, so then your code would be something like (untested):

    { curr = tolower(substr($0,1,3)) }
    curr != prev {
        close(Fpath)
        Fpath = "gzip -9 -f >> " gensub(/[^[:alnum:]]/,"_","g",curr)
        prev = curr
    }
    { print | Fpath }

This isn't intended to speed things up other than the find suggestion above, it's just a cleanup of the code in your question to reduced redundancy and common bugs (UUOC, missing quotes, wrong way to read output of find, incorrect use of >> vs >, etc.). Start with something like this (untested and assuming you do need to separate the output files for each directory):

#/usr/bin/env bash

clean_in() {
    awk '
        {
            gsub(/^[ \t"'\'']+|[ \t"'\'']+$/, "")
            sub(/[,|;: \t]+/, ":")
            if (/^[[:alnum:]_.+-]+@[[:alnum:]_.-]+\.[[:alnum:]]+:/ && /^[\x00-\x7F]*$/) {
                print
            }
            else {
                print > "_leftover"
            }
        } 
    ' "${@:--}"
}

split_out() {
    local n="$1"
    shift
    awk -v n="$n" '
        { curr = tolower(substr($0,1,n)) }
        curr != prev {
            close(Fpath)
            Fpath = gensub(/[^[:alnum:]]/,"_","g",curr)
            prev = curr
        }
        { print > Fpath }
    ' "${@:--}"
}

Func_Clean() {
    local dir="$1"
    printf '%s\n' "$dir" >&2
    pushd "$dir" > /dev/null
    clean_in *.txt |
        sort -t':' -k1,1 |
            split_out 2 &&
    rm -f *.txt &&
    readarray -d '' big_files < <(find . -type f -prune -size +1000000c \( ! -iname "_leftover" \) -print0) &&
    split_out 3 "${big_files[@]}" &&
    rm -f "${big_files[@]}"
    popd > /dev/null
}

### MAIN - Starting Point ###
base_folder="_test2"
while IFS= read -r dir; do
    Func_Clean "$dir"
done < <(find "$base_folder" -mindepth 1 -type d)

If I were you I'd start with that (after any necessary testing/debugging) and THEN look for ways to improve the performance.

回答2:

You are making things harder on yourself than they need to be. To separate your log into the file em with the sanitized addresses and put the rest in _leftover, you simply need to identify the lines matching /email[0-9]+@/ and then apply whatever sanitizations you need (e.g. remove anything before "email[0-9]+@", convert any included ';' to ':', add more as needed). You then simply redirect the sanitized lines to em and skip to the next record.

    /email[0-9]+@/ {
        $0 = substr($0,match($0,/email[0-9]+@/))
        gsub(/;/,":")
        # add any additional sanitizations here
        print > "em"
        next
    }

The next rule simply collects the remainder of the lines in an array.

    {a[++n] = $0}

The final rule (the END rule), just loops over the array redirecting the contents to _leftover.

    END {
        for (i=1; i<=n; i++)
            print a[i] > "_leftover"
    }

Simply combine your rules into the final script. For example:

awk '
    /email[0-9]+@/ {
        $0 = substr($0,match($0,/email[0-9]+@/))
        gsub(/;/,":")
        # add any additional sanitizations here
        print > "em"
        next
    } 
    {a[++n] = $0}
    END {
        for (i=1; i<=n; i++)
            print a[i] > "_leftover"
    }
' file

When working with awk -- it will read each line (record) and then apply each rule you have written, in order, to each record. So you simply write and order the rules you need to handle manipulating the text in each line.

You can use next to skip to the next record to help control the logic between rules (along with all other conditions, e.g. if, else, ...) The GNU awk manual is a good reference to keep handy as you learn awk.

Example Use/Output

With your input in file you would receive the following in em and _leftover:

$ cat em
email1@foo.com:datahere2
email2@foo.com:datahere2
email3@foo.com datahere2
email5@foo.com:dtat'ah'ere2
email3@foo.com:datahere2

$ cat _leftover
wrongemailfoo.com
nonascii@row.com;data.is.junk-Œœ

As noted, this script simply trims anything before email...@ and replaces all ';' with ':' -- you will need to add any additional clean-ups you need where indicated.

来源：https://stackoverflow.com/questions/62480910/make-awk-efficient-again

标签

bash

awk