What is the fastest way to create a hash function which will be used to check if two files are equal?
Security is not very important.
Edit: I am sending a fi
What we are optimizing here is time spent on a task. Unfortunately we do not know enough about the task at hand to know what the optimal solution should be.
Is it for one-time comparison of 2 arbitrary files? Then compare size, and after that simply compare the files, byte by byte (or mb by mb) if that's better for your IO.
If it is for 2 large sets of files, or many sets of files, and it is not a one-time exercise. but something that will happen frequently, then one should store hashes for each file. A hash is never unique, but a hash with a number of say 9 digits (32 bits) would be good for about 4 billion combination, and a 64 bit number would be good enough to distinguish between some 16 * 10^18 Quintillion different files.
A decent compromise would be to generate 2 32-bit hashes for each file, one for first 8k, another for 1MB+8k, slap them together as a single 64 bit number. Cataloging all existing files into a DB should be fairly quick, and looking up a candidate file against this DB should also be very quick. Once there is a match, the only way to determine if they are the same is to compare the whole files.
I am a believer in giving people what they need, which is not always never what they think they need, or what the want.