Scan duplicate document with md5

孤者浪人 提交于 2019-12-04 18:49:02
guillaume girod-vitouchkina

WHOLE PROCESS:

your goal is to detect (and perhaps store information about) duplicate files.

1 Then, first, you have to iterate through directories and files,

see this:

list all files from directories and subdirectories in Java

2 and for each file, to load it like a byte array

see this:

Reading a binary input stream into a single byte array in Java

3 then compute your MD5 - your project

4 and store this information

Your can use a Set to dectect duplicates (a Set has unique elements).

Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already

or a

Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set

ANSWER for MD5:

read algorithm: https://en.wikipedia.org/wiki/MD5

RFC: https://www.ietf.org/rfc/rfc1321.txt

some googling ...

this presentation, step by step http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

or try to duplicate C (or java) implementation ...

OVERALL STRATEGY

To keep time and have processus faster, you must also think about the use of your function:

  • if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.

  • if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.

  • if you want to get all files duplicated, better scan everything, and use Set Strategy also

Hope this helps

JimmyB

You'll want to recursively scan for files, then, for each file found, calculate its MD5 or whatever and store that hash value, either in a Set<...> if you only want to know if a file is a dupe, or in a Map<..., File> if you want to be able to tell which file the current file is a duplicate of.

For each file's hash, you look into the collection of already known hashes to check if that particular hash value is in it; if it is, you (most likely) have a duplicate file; if it is not, you add the new hash value to the collection and proceed with the next file.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!