Scan duplicate document with md5

坚强是说给别人听的谎言 提交于 2019-12-14 00:28:48

问题


for some reasons I can't use MessageDigest.getInstance("MD5"), so I must write the algorithm code in manual way, my project is scan duplicate document (*.doc, *.txt, *.pdf) on Android device. My question is, what must I write before entering the algorithm, to scan the duplicate document on MY ROOT directory on Android device? Without select the directory, when I press button scan, the process begin, the listview show. Is anyone can help me? My project deadline will come. Thank you so much.

public class MD5 {

//What must I write here, so I allow to scan for duplicate document on Android root with MD5 Hash

//MD5 MANUAL ALGORITHM CODE
}

回答1:


WHOLE PROCESS:

your goal is to detect (and perhaps store information about) duplicate files.

1 Then, first, you have to iterate through directories and files,

see this:

list all files from directories and subdirectories in Java

2 and for each file, to load it like a byte array

see this:

Reading a binary input stream into a single byte array in Java

3 then compute your MD5 - your project

4 and store this information

Your can use a Set to dectect duplicates (a Set has unique elements).

Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already

or a

Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set

ANSWER for MD5:

read algorithm: https://en.wikipedia.org/wiki/MD5

RFC: https://www.ietf.org/rfc/rfc1321.txt

some googling ...

this presentation, step by step http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

or try to duplicate C (or java) implementation ...

OVERALL STRATEGY

To keep time and have processus faster, you must also think about the use of your function:

  • if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.

  • if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.

  • if you want to get all files duplicated, better scan everything, and use Set Strategy also

Hope this helps




回答2:


You'll want to recursively scan for files, then, for each file found, calculate its MD5 or whatever and store that hash value, either in a Set<...> if you only want to know if a file is a dupe, or in a Map<..., File> if you want to be able to tell which file the current file is a duplicate of.

For each file's hash, you look into the collection of already known hashes to check if that particular hash value is in it; if it is, you (most likely) have a duplicate file; if it is not, you add the new hash value to the collection and proceed with the next file.



来源:https://stackoverflow.com/questions/34333653/scan-duplicate-document-with-md5

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!