algorithm to find duplicates

丶灬走出姿态 提交于 2019-12-05 18:02:30

If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

In other words:

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

Doing a hash for every file will take too much time and will be useless if most of your files are different.

Perhaps you want to hash each object and store the hashes in some sort of table? To test for duplicates, you just do a quick lookup in the table.

Mystery data structure???

As for a "famous algorithm" to accomplish this task, take a look at MD5.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!