algorithm to find duplicates

Are there any famous algorithms to efficiently find duplicates?

For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?

If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

In other words:

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

Doing a hash for every file will take too much time and will be useless if most of your files are different.

Perhaps you want to hash each object and store the hashes in some sort of table? To test for duplicates, you just do a quick lookup in the table.

Mystery data structure???

As for a "famous algorithm" to accomplish this task, take a look at MD5.

来源：https://stackoverflow.com/questions/6507272/algorithm-to-find-duplicates

标签

algorithm

file

duplicates