grep -f maximum number of patterns?

前端 未结 5 1237
悲&欢浪女
悲&欢浪女 2020-12-18 04:53

I\'d like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn\'t like this (who, knew?). After a day, it didn\'t produce

5条回答
  •  攒了一身酷
    2020-12-18 05:19

    Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions. Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity. Run the script in a terminal window with a command of the form

    ./timeScript keyFile textFile 100 > outputFile

    This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.

    In a second terminal, run the command

    tail -f outputFile

    which will keep track of the output of your other process into the file outputFile

    I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.

    This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.

    You could take the output of the script and put it through a grep:

    grep entire outputFile

    You will end up with just the summaries - block size, and time taken, e.g.

    Time for processing entire file with blocksize 800: 4 seconds

    If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.

    Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...

    #!/bin/bash
    # script to look at difference in timing
    # when grepping a file with a large number of expressions
    # assume first argument = name of file with list of expressions
    # second argument = name of file to check
    # optional third argument = initial block size (default 100)
    #
    # split f1 into chunks of 1, 2, 4, 8... expressions at a time
    # and print out how long it took to process all the lines in f2
    
    if (($# < 2 )); then
      echo Warning: need at leasttwo parameters.
      echo Usage: timeScript keyFile searchFile [initial blocksize]
      exit 0
    fi
    
    f1_linecount=`cat $1 | wc -l`
    echo linecount of file1 is $f1_linecount
    
    f2_linecount=`cat $2 | wc -l`
    echo linecount of file2 is $f2_linecount
    echo
    
    if (($# < 3 )); then
      blockLength=100
    else
      blockLength=$3
    fi
    
    while (($blockLength < f1_linecount))
    do
      echo Using blocks of $blockLength
      #split is a built in command that splits the file
      # -l tells it to break after $blockLength lines
      # and the block$blockLength parameter is a prefix for the file
      split -l $blockLength $1 block$blockLength
      Tstart="$(date +%s)"
      Tbefore=$Tstart
    
      for fn in block*
        do
          echo "grep -f $fn $2 | wc -l"
          echo number of lines matched: `grep -f $fn $2 | wc -l`
          Tnow="$(($(date +%s)))"
          echo Time taken: $(($Tnow - $Tbefore)) s
          Tbefore=$Tnow
        done
      echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
      blockLength=$((2*$blockLength))
      # remove the split files - no longer needed
      rm block*
      echo block length is now $blockLength and f1 linecount is $f1_linecount
    done
    
    exit 0
    

提交回复
热议问题