Calculating a hash code for a large file in parallel

吃可爱长大的小学妹 提交于 2019-12-04 02:52:32

Block-based hashing (your method 2) is a well known technique that is used in practice:

Just like what you're doing, these methods takes the list of block hashes and hashes that again, down to a single short hash. Since this is a well established practice, I would assume that it is as secure as sequential hashing.

Some modern hash designs allow them to be run in parallel. See An Efficient Parallel Algorithm for Skein Hash Functions. If you are willing to use a new (and hence less thoroughly tested) hash algorithm, this may give you the speed increase you want on a multi-processor machine.

Skein has reached the final stage of the NIST SHA-3 competition so it is not completely untested.

I think it would be significantly easier for an attacker to find a collision, because the time required to generate a hash is a function of the size of the data to hash. One of the great things about cryptographically secure hashes is that an attacker can't take your 100Gb file, find a spot they want to mutate, hash everything before and after that block, and then use those pre-computed values to quickly get the hash of the whole file after small/quick permutations to the bit their interested in. This is because theres an overlapping sliding window in the hashing algorithm.

In short, if you edit the middle of the file, you still need to hash the whole file to get the final checksum. Thus a 100Gb file takes a lot longer to find a collision in, than a 100byte file. The exception is when the edit is nonsense right at the end of the file, which is why that is so frequently seen for 'in-the-wild' in collision examples for large files.

However, if you break your original file up into blocks, the speed of an attack is now a function of the smallest block (or the size of the block you want to mutate). Since file size increases linearly with hashing time, a 100Gb file will take roughly 2000 seconds for each permutation/MD5, while a 1Mb block would allow an attacker to try 50 per second.

The solution would be to break your file up into overlapping chunks, then MD5 those chunks individually. The resultant hash would be a concatenation of the hashes in both start-to-end order, and end-to-start. Now finding a collision requires the whole file to be hashed - albeit in a parallelized way.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!