Scan duplicate document with md5

问题

for some reasons I can't use MessageDigest.getInstance("MD5"), so I must write the algorithm code in manual way, my project is scan duplicate document (*.doc, *.txt, *.pdf) on Android device. My question is, what must I write before entering the algorithm, to scan the duplicate document on MY ROOT directory on Android device? Without select the directory, when I press button scan, the process begin, the listview show. Is anyone can help me? My project deadline will come. Thank you so much.

public class MD5 {

//What must I write here, so I allow to scan for duplicate document on Android root with MD5 Hash

//MD5 MANUAL ALGORITHM CODE
}

回答1:

WHOLE PROCESS:

your goal is to detect (and perhaps store information about) duplicate files.

1 Then, first, you have to iterate through directories and files,

see this:

list all files from directories and subdirectories in Java

2 and for each file, to load it like a byte array

see this:

Reading a binary input stream into a single byte array in Java

3 then compute your MD5 - your project

4 and store this information

Your can use a Set to dectect duplicates (a Set has unique elements).

Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already

or a

Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set

ANSWER for MD5:

read algorithm: https://en.wikipedia.org/wiki/MD5

RFC: https://www.ietf.org/rfc/rfc1321.txt

some googling ...

this presentation, step by step http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf

or try to duplicate C (or java) implementation ...

OVERALL STRATEGY

To keep time and have processus faster, you must also think about the use of your function:

if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.
if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.
if you want to get all files duplicated, better scan everything, and use Set Strategy also

Hope this helps

回答2:

You'll want to recursively scan for files, then, for each file found, calculate its MD5 or whatever and store that hash value, either in a Set<...> if you only want to know if a file is a dupe, or in a Map<..., File> if you want to be able to tell which file the current file is a duplicate of.

For each file's hash, you look into the collection of already known hashes to check if that particular hash value is in it; if it is, you (most likely) have a duplicate file; if it is not, you add the new hash value to the collection and proceed with the next file.

来源：https://stackoverflow.com/questions/34333653/scan-duplicate-document-with-md5

标签

java

android

algorithm

md5