algorithm to find duplicates

假如想象 提交于 2019-12-07 09:45:55

问题


Are there any famous algorithms to efficiently find duplicates?

For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?


回答1:


If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

In other words:

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

Doing a hash for every file will take too much time and will be useless if most of your files are different.




回答2:


Perhaps you want to hash each object and store the hashes in some sort of table? To test for duplicates, you just do a quick lookup in the table.

Mystery data structure???

As for a "famous algorithm" to accomplish this task, take a look at MD5.



来源:https://stackoverflow.com/questions/6507272/algorithm-to-find-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!