grep from tar.gz without extracting [faster one]

前端 未结 8 692
Happy的楠姐
Happy的楠姐 2020-12-13 04:02

Am trying to grep pattern from dozen files .tar.gz but its very slow

am using

tar -ztf file.tar.gz | while read FILENAME
do
        if tar -zxf file         


        
8条回答
  •  隐瞒了意图╮
    2020-12-13 04:02

    All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:

    • The name of both the archive file and the extracted file
    • The line number where the pattern was found
    • The contents of the matching line

    It's what I was really hoping that zgrep could do for me and it just can't.

    Here's my solution:

    pattern=$1
    for f in *.tar.gz; do
         echo "$f:"
         tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
    done
    

    You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:

    tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
    

    Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.

    tar -xzf: x extract, z filter through gzip, f based on the following archive file...

    "$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.

    --to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.

    Let's break that part down by itself, since it's the "secret sauce" here.

    'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
    

    First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.

    grep: The command to be run on the (not actually) extracted files

    --label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.

    basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file

    -Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match

    Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.

    Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)


    And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.

    if [ -z "$1" ]; then
        echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
        echo "Usage: targrep  [start date]"
    fi
    pattern=$1
    startdatein=$2
    startdate=$(date -d "$startdatein" +%s)
    for f in *.tar.gz; do
        filedate=$(date -r "$f" +%s)
        if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
            echo "$f:"
            tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
        fi
    done
    

    And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.

    Usage:

    targrep.sh [-d ] [-f ]

    Example:

    targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford

    while getopts "d:f:" opt; do
        case $opt in
                d) startdatein=$OPTARG;;
                f) targetfile=$OPTARG;;
        esac
    done
    shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
    pattern=$1
    
    echo "Searching for: $pattern"
    if [[ -n $targetfile ]]; then
        echo "in filenames:  $targetfile"
    fi
    
    startdate=$(date -d "$startdatein" +%s)
    for f in *.tar.gz; do
        filedate=$(date -r "$f" +%s)
        if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
                echo "$f:"
                if [[ -z "$targetfile" ]]; then
                        tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
                else
                        tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
                fi
        fi
    done
    

提交回复
热议问题