md5/sha1 hashing large files

问题

I have over 1/2 million files to hash over multiple folders An md5/crc hashing is taking too long some files are 1GB ~ 11Gb in size Im thinking of just hashing part of the file using head

So the below works when it comes to hashing finding and hashing everything.

find . -type f -exec sha1sum {} \;

Im just sure how to take this a step further and just do hash for the first say 256kB of the file e.g

find . -type f -exec head -c 256kB | sha1sum

Not sure if head is okay to use in this instance of would dd be better? The above command doesn't work so looking for ideas on how I can do this

I would like the output to be the same as what is seen in a native md5sum e.g in the below format (going to a text file)

<Hash>  <file name>

Im not sure if the above is possible with a single line or will a for/do loop need to be used..... Performance is key using bash on RHEL6

回答1:

It is unclear where your limitation is. Do you have a slow disk or a slow CPU?

If your disk is not the limitation, you are probably limited by using a single core. GNU Parallel can help with that:

find . -type f | parallel -X sha256sum

If the limitation is disk I/O, then your idea of head makes perfect sense:

sha() {
   tail -c 1M "$1" | sha256sum | perl -pe 'BEGIN{$a=shift} s/-/$a/' "$1";
}
export -f sha
find . -type f -print0 | parallel -0 -j10 --tag sha

The optimal value of -j10 depends on your disk system, so try adjusting it until you find the optimal value (which can be as low as -j1).

来源：https://stackoverflow.com/questions/28817057/md5-sha1-hashing-large-files

标签

Linux

bash

hash