DigestInputStream -> compute the hash without slowdown when the consumer part is the bottleneck

試著忘記壹切 提交于 2019-12-11 20:15:30

问题


I have an application that needs to transfer files to a service like S3.

I have an InputStream of that incoming file (not necessarily a FileInputStream), and I write this InputStream to a multipart request body that is represented by an OutputStream, and then I need to write the hash of the file at the end (also through the request body).

Thanks to the DigestInputStream, I'm able to compute the hash live, so after the file body has been sent to the OutputStream, the hash becomes available and can also be appended to the multipart request.

You can check this related question: What is the less expensive hash algorithm?

And particularly my own benchmark answer: https://stackoverflow.com/a/19160508/82609

So it seems my own computer is capable of hashing with a MessageDigest with a throughput of 500MB/s for MD5, and nearly 200MB/s for SHA-512.

The connection to which I write the request body has a throughput of 100MB/s. If I write to the OutputStream with a higher throughput, the OutputStream starts to block (this is done intentionnally because we do want to keep a low memory footprint and do not want bytes to accumulate in some part of the application)


I have done tests and I can clearly notice the impact of the algorithm on the performances of my application.

I tried to upload 20 files of 50MB (1Gb total).

  • With MD5, it takes ~16sec
  • With SHA-512, it takes ~22sec

When doing a single upload, I can also see a slowdown of the same order.

So in the end there is no parallelisation of the computation of the hash and the write to the connection: these steps are done sequentially:

  • Request bytes from the stream
  • Hashing the bytes requested
  • Sending the bytes

So as the hashing has a throughput > the connection throughput, is there an easy way to not have that slowdown? Does it require additional threads?

I think the next chunk of data could be precomputed and hashed during the previous chunk is being written to the connection right?

This is not a premature optimization, we need to upload a lot of documents and the execution time is sensible for our business.

来源:https://stackoverflow.com/questions/19177045/digestinputstream-compute-the-hash-without-slowdown-when-the-consumer-part-is

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!