Performance issues when looping an MD5 calculator on many files

房东的猫 提交于 2019-12-03 20:14:40

How can I make my code as efficient as possible?

In two words: Profile It!

Get your code working, and then profile it while running on a typical set of input files. Use that to tell you where the performance hotspots are.

If I was doing this, I'd first start with a single-threaded version and tune for that case. Then I'd slowly wind up the number of threads to see how performance scales. Once you've hit a "sweet spot", redo the profiling and look to see where the bottlenecks are now.


It is actually hard to predict where the performance bottlenecks are going to be. It is going to depend on things like average file sizes, the number of cores you have, the speed of your discs and the amount of memory available to the OS for read-ahead buffering. And also, what operating system you are using.

My gut feeling is that the number of threads is going to be rather important. Too few and the CPU sits idle waiting for the I/O system to fetch stuff from disc. Too many and you use extra resources (like memory for thread stacks) with no real benefit. An application like this is likely to be I/O bound, and a large number of threads are not going to address that.


You commented thus:

The performance issues are purely memory. I'm pretty sure there's a problem with the way I'm creating the MD5 hash so that it wastes memory.

I can see nothing in the code you have provided that would use lots of memory. There is nothing grossly wrong with the way you are generating the hashes. AFAICT, the only way that your code could lead to memory usage issues are if:

  • you have many, many threads all executing that code, or
  • you are keeping many, many hashes (and other things) in memory. (You don't show us what add is doing.)

But my advice is similar, use a memory profiler and diagnose this as if it was a storage leak which, in a sense, it is!

Three things from taking a quick glance at you code:

  • You don't need to create a new MessageDigest every time you call the toMD5 method. One per thread should be sufficient.
  • You don't need to create a new byte[] buffer every time you call the toMD5 method. One per thread should be sufficient.
  • You might want to use javax.xml.bind.DatatypeConverter.printHexBinary(byte[]) for your hex conversion. It's faster.

You can address the top two bullets by using a ThreadLocal for each.

Any further optimization will probably have to come from concurrency. Have one thread read file contents, and dispatch those byte[] to different threads to actually compute the MD5 checksum.

Use a much bigger buffer, at least 8192, or interpose a BufferedInputStream.

Thanks for the help people. The problem was that the amount of info going through was so high and so large that the GC couldn't work correctly. The proof-of-concept solution was to add a Thread.sleep(1000) after each 200 photos. A full solution would be to use a more aggressive approach with the GC and to calculate the MD5 for bulks at a time.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!