问题
Are there any famous algorithms to efficiently find duplicates?
For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?
回答1:
If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.
Then just compare the file's hashes. If they're the same, you've got a duplicate file.
There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.
In other words:
generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.
Doing a hash for every file will take too much time and will be useless if most of your files are different.
回答2:
Perhaps you want to hash each object and store the hashes in some sort of table? To test for duplicates, you just do a quick lookup in the table.
Mystery data structure???
As for a "famous algorithm" to accomplish this task, take a look at MD5.
来源:https://stackoverflow.com/questions/6507272/algorithm-to-find-duplicates