algorithm to find duplicates

问题

Are there any famous algorithms to efficiently find duplicates?

For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?

回答1:

If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

In other words:

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

Doing a hash for every file will take too much time and will be useless if most of your files are different.

回答2:

Perhaps you want to hash each object and store the hashes in some sort of table? To test for duplicates, you just do a quick lookup in the table.

Mystery data structure???

As for a "famous algorithm" to accomplish this task, take a look at MD5.

来源：https://stackoverflow.com/questions/6507272/algorithm-to-find-duplicates

标签

algorithm

file

duplicates