Performance issues when looping an MD5 calculator on many files

I'm creating a program that checks files by comparing their MD5s to a DB of already checked MD5s.

It loops through thousands of files, and I see that it uses a lot of memory.

How can I make my code as efficient as possible?

    for (File f : directory.listFiles()) {


        String MD5;
        //Check if the Imagefile instance is an image. If so, check if it's already in the pMap.
        if (Utils.isImage(f)) {
            MD5 = Utils.toMD5(f);
            if (!SyncFolderMapImpl.MD5Map.containsKey(MD5)) {

                System.out.println("Adding " + f.getName() + " to DB");
                add(new PhotoDTO(f.getPath(), MD5, albumName));
            }
        }

And this is toMD5:

  public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {
    MessageDigest md = MessageDigest.getInstance("MD5");
    FileInputStream fis = new FileInputStream(file.getPath());


    byte[] dataBytes = new byte[8192];

    int nread = 0;
    while ((nread = fis.read(dataBytes)) != -1) {
        md.update(dataBytes, 0, nread);
    }

    byte[] mdbytes = md.digest();

    //convert the byte to hex format method 2
    StringBuffer hexString = new StringBuffer();
    for (int i = 0; i < mdbytes.length; i++) {
        String hex = Integer.toHexString(0xff & mdbytes[i]);
        if (hex.length() == 1) hexString.append('0');
        hexString.append(hex);
    }
    return hexString.toString();
}

EDIT: Tried to use FastMD5. Same result.

public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {

    return MD5.asHex(MD5.getHash(file));
}

EDIT 2 Tried to use ThreadLocal and BufferedInputStream. I still have lots of memory usage.

private static ThreadLocal<MessageDigest> md = new ThreadLocal<MessageDigest>(){
     protected MessageDigest initialValue() {
         try {
             return MessageDigest.getInstance("MD5");
         } catch (NoSuchAlgorithmException e) {
             e.printStackTrace();  //To change body of catch statement use File | Settings | File Templates.
         }
         System.out.println("Fail");
         return null;

     }
};


private static ThreadLocal<byte[]> dataBytes = new ThreadLocal<byte[]>(){

    protected byte[] initialValue(){
     return new byte[1024];
    }

};

public static String toMD5(File file) throws IOException, NoSuchAlgorithmException {

    //        MessageDigest mds = md.get();
    BufferedInputStream fis = new BufferedInputStream(new FileInputStream(file));


    //        byte[] dataBytes = new byte[1024];

    int nread = 0;
    while ((nread = fis.read(dataBytes.get())) != -1) {
        md.get().update(dataBytes.get(), 0, nread);
    }

    byte[] mdbytes = md.get().digest();

    //convert the byte to hex format method 2
    StringBuffer hexString = new StringBuffer();
    fis.close();
    System.gc();
    return javax.xml.bind.DatatypeConverter.printHexBinary(mdbytes).toLowerCase();




     //        return MD5.asHex(MD5.getHash(file));
}

How can I make my code as efficient as possible?

In two words: Profile It!

Get your code working, and then profile it while running on a typical set of input files. Use that to tell you where the performance hotspots are.

If I was doing this, I'd first start with a single-threaded version and tune for that case. Then I'd slowly wind up the number of threads to see how performance scales. Once you've hit a "sweet spot", redo the profiling and look to see where the bottlenecks are now.

It is actually hard to predict where the performance bottlenecks are going to be. It is going to depend on things like average file sizes, the number of cores you have, the speed of your discs and the amount of memory available to the OS for read-ahead buffering. And also, what operating system you are using.

My gut feeling is that the number of threads is going to be rather important. Too few and the CPU sits idle waiting for the I/O system to fetch stuff from disc. Too many and you use extra resources (like memory for thread stacks) with no real benefit. An application like this is likely to be I/O bound, and a large number of threads are not going to address that.

You commented thus:

The performance issues are purely memory. I'm pretty sure there's a problem with the way I'm creating the MD5 hash so that it wastes memory.

I can see nothing in the code you have provided that would use lots of memory. There is nothing grossly wrong with the way you are generating the hashes. AFAICT, the only way that your code could lead to memory usage issues are if:

you have many, many threads all executing that code, or
you are keeping many, many hashes (and other things) in memory. (You don't show us what add is doing.)

But my advice is similar, use a memory profiler and diagnose this as if it was a storage leak which, in a sense, it is!

Three things from taking a quick glance at you code:

You don't need to create a new MessageDigest every time you call the toMD5 method. One per thread should be sufficient.
You don't need to create a new byte[] buffer every time you call the toMD5 method. One per thread should be sufficient.
You might want to use javax.xml.bind.DatatypeConverter.printHexBinary(byte[]) for your hex conversion. It's faster.

You can address the top two bullets by using a ThreadLocal for each.

Any further optimization will probably have to come from concurrency. Have one thread read file contents, and dispatch those byte[] to different threads to actually compute the MD5 checksum.

Use a much bigger buffer, at least 8192, or interpose a BufferedInputStream.

Thanks for the help people. The problem was that the amount of info going through was so high and so large that the GC couldn't work correctly. The proof-of-concept solution was to add a Thread.sleep(1000) after each 200 photos. A full solution would be to use a more aggressive approach with the GC and to calculate the MD5 for bulks at a time.

来源：https://stackoverflow.com/questions/18244691/performance-issues-when-looping-an-md5-calculator-on-many-files

标签

java

performance

md5