grep -f maximum number of patterns?

前端未结

关注

 5  1237

悲&欢浪女 2020-12-18 04:53

I\'d like to use grep on a text file with -f to match a long list (10,000) of patterns. Turns out that grep doesn\'t like this (who, knew?). After a day, it didn\'t produce

5条回答

攒了一身酷 (楼主)

2020-12-18 05:19
Here is a bash script you can run on your files (or if you would like, a subset of your files). It will split the key file into increasingly large blocks, and for each block attempt the grep operation. The operations are timed - right now I'm timing each grep operation, as well as the total time to process all the sub-expressions. Output is in seconds - with some effort you can get ms, but with the problem you are having it's unlikely you need that granularity. Run the script in a terminal window with a command of the form

./timeScript keyFile textFile 100 > outputFile

This will run the script, using keyFile as the file where the search keys are stored, and textFile as the file where you are looking for keys, and 100 as the initial block size. On each loop the block size will be doubled.

In a second terminal, run the command

tail -f outputFile

which will keep track of the output of your other process into the file outputFile

I recommend that you open a third terminal window, and that you run top in that window. You will be able to see how much memory and CPU your process is taking - again, if you see vast amounts of memory consumed it will give you a hint that things are not going well.

This should allow you to find out when things start to slow down - which is the answer to your question. I don't think there's a "magic number" - it probably depends on your machine, and in particular on the file size and the amount of memory you have.

You could take the output of the script and put it through a grep:

grep entire outputFile

You will end up with just the summaries - block size, and time taken, e.g.

Time for processing entire file with blocksize 800: 4 seconds

If you plot these numbers against each other (or simply inspect the numbers), you will see when the algorithm is optimal, and when it slows down.

Here is the code: I did not do extensive error checking but it seemed to work for me. Obviously in your ultimate solution you need to do something with the outputs of grep (instead of piping it to wc -l which I did just to see how many lines were matched)...
```
#!/bin/bash
# script to look at difference in timing
# when grepping a file with a large number of expressions
# assume first argument = name of file with list of expressions
# second argument = name of file to check
# optional third argument = initial block size (default 100)
#
# split f1 into chunks of 1, 2, 4, 8... expressions at a time
# and print out how long it took to process all the lines in f2

if (($# < 2 )); then
  echo Warning: need at leasttwo parameters.
  echo Usage: timeScript keyFile searchFile [initial blocksize]
  exit 0
fi

f1_linecount=`cat $1 | wc -l`
echo linecount of file1 is $f1_linecount

f2_linecount=`cat $2 | wc -l`
echo linecount of file2 is $f2_linecount
echo

if (($# < 3 )); then
  blockLength=100
else
  blockLength=$3
fi

while (($blockLength < f1_linecount))
do
  echo Using blocks of $blockLength
  #split is a built in command that splits the file
  # -l tells it to break after $blockLength lines
  # and the block$blockLength parameter is a prefix for the file
  split -l $blockLength $1 block$blockLength
  Tstart="$(date +%s)"
  Tbefore=$Tstart

  for fn in block*
    do
      echo "grep -f $fn $2 | wc -l"
      echo number of lines matched: `grep -f $fn $2 | wc -l`
      Tnow="$(($(date +%s)))"
      echo Time taken: $(($Tnow - $Tbefore)) s
      Tbefore=$Tnow
    done
  echo Time for processing entire file with blocksize $blockLength: $(($Tnow - $Tstart)) seconds
  blockLength=$((2*$blockLength))
  # remove the split files - no longer needed
  rm block*
  echo block length is now $blockLength and f1 linecount is $f1_linecount
done

exit 0
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...