Max limit of bytes in method update of Hashlib Python module

只愿长相守 提交于 2019-12-18 09:02:29

问题


I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module.

So that I writed this piece of code:

Buffer = 128
f = open("c:\\file.tct", "rb")
m = hashlib.md5()

while True:
   p = f.read(Buffer)
   if len(p) != 0:
      m.update(p)
   else:
      break
print m.hexdigest()
f.close()

I noted the function update is faster if I increase Buffer variable value with 64, 128, 256 and so on. There is a upper limit I cannot exceed? I suppose it might only a RAM memory problem but I don't know.


回答1:


Big (≈2**40) chunk sizes lead to MemoryError i.e., there is no limit other than available RAM. On the other hand bufsize is limited by 2**31-1 on my machine:

import hashlib
from functools import partial

def md5(filename, chunksize=2**15, bufsize=-1):
    m = hashlib.md5()
    with open(filename, 'rb', bufsize) as f:
        for chunk in iter(partial(f.read, chunksize), b''):
            m.update(chunk)
    return m

Big chunksize can be as slow as a very small one. Measure it.

I find that for ≈10MB files the 2**15 chunksize is the fastest for the files I've tested.




回答2:


To be able to handle arbitrarily large files you need to read them in blocks. The size of such blocks should preferably be a power of 2, and in the case of md5 the minimum possible block consists of 64 bytes (512 bits) as 512-bit blocks are the units on which the algorithm operates.

But if we go beyond that and try to establish an exact criterion whether, say 2048-byte block is better than 4096-byte block... we will likely fail. This needs to be very carefully tested and measured, and almost always the value is being chosen at will, judging from experience.




回答3:


The buffer value is the number of bytes that is read and stored in memory at once, so yes, the only limit is your available memory.

However, bigger values are not automatically faster. At some point, you might run into memory paging issues or other slowdowns with memory allocation if the buffer is too large. You should experiment with larger and larger values until you hit diminishing returns in speed.



来源:https://stackoverflow.com/questions/4949162/max-limit-of-bytes-in-method-update-of-hashlib-python-module

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!