Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

删除回忆录丶 提交于 2019-12-07 13:38:40

问题


My current approach is this:

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    with open(path, 'rb') as f:
         for block in iter(lambda: f.read(1024*func.block_size, b''):
             func.update(block)
    return func.hexdigest()

It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 @ 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?

EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).

EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(

EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.

Another solution (~-0.5 sec, slightly faster) is to use os.open():

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    f = os.open(path, (os.O_RDWR | os.O_BINARY))
    for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
        func.update(block)
    os.close(f)
    return func.hexdigest()

Note that these results are not final.


回答1:


Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

  • Using your first method required 21 seconds.
  • Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
  • Using the following function with a buffer size of 8096 required 17 seconds.
  • Using the following function with a buffer size of 32767 required 11 seconds.
  • Using the following function with a buffer size of 65536 required 8 seconds.
  • Using the following function with a buffer size of 131072 required 8 seconds.
  • Using the following function with a buffer size of 1048576 required 12 seconds.

def md5_speedcheck(path, size): pts = time.process_time() ats = time.time() m = hashlib.md5() with open(path, 'rb') as f: b = f.read(size) while len(b) > 0: m.update(b) b = f.read(size) print("{0:.3f} s".format(time.process_time() - pts)) print("{0:.3f} s".format(time.time() - ats))

Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.



来源:https://stackoverflow.com/questions/22733826/is-there-a-faster-way-than-this-to-calculate-the-hash-of-a-file-using-hashlib

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!