I\'m developing a back-end application for a search system. The search system copies files to a temporary directory and gives them random names. Then it passes the temporary
Equal hash means equal file, unless someone malicious is messing around with your files and injecting collisions. (this could be the case if they are downloading stuff from the internet) If that is the case go for a SHA2 based function.
There are no accidental MD5 collisions, 1,47x10-29 is a really really really small number.
To overcome the issue of rehashing big files I would have a 3 phased identity scheme.
So if you see a file with a new size you know for certain you do not have a duplicate. And so on.