问题
I am using ByteBuffer.allocateDirect() to allocate some buffer memory for reading a file into memory and then eventually hashing that files bytes and getting a file hash (SHA) out of it. The input files range greatly in size, anywhere from a few KB's to several GB's.
I have read several threads and pages (even some on SO) regarding selecting a buffer size. Some advised trying to select one that the native FileSystem uses in an attempt to minimalize chances of a read operation for a partial block,etc. Such as buffer of 4100 bytes and NTFS defaults to 4096, so the extra 4 bits would require a separate read operation, being extremely wasteful.
So sticking with the powers of 2, 1024, 2048, 4096, 8192, etc. I have seen some recommend buffers the size of 32KB's, and other recommend making the buffer the size of the input file (probably fine for small files, but what about large files?).
How important is it to stick to native block sized buffers? Modernly speaking (assuming modern SATA drive or better with at least 8Mb of on drive cache, and other modern OS "magic" to optimize I/O) how critical is the buffer size and how should I best determine what size to set mine to? I could statically set it, or dynamically determine it? Thanks for any insight.
回答1:
To answer your direct question: (1) filesystems tend to use powers of 2, so you want to do the same. (2) the larger your working buffer, the less effect any mis-sizing will have.
As you say, if you allocate 4100 and the actual block size is 4096, you'll need two reads to fill the buffer. If, instead, you have a 1,000,000 byte buffer, then being one block high or low doesn't matter (because it takes 245 4096-byte blocks to fill that buffer). Moreover, the larger buffer means that the OS has a better chance to order the reads.
That said, I wouldn't use NIO for this. Instead, I'd use a simple BufferedInputStream, with maybe a 1k buffer for my read()s.
The main benefit of NIO is keeping data out of the Java heap. If you're reading and writing a file, for example, using an InputStream means that the OS reads the data into a JVM-managed buffer, the JVM copies that into an on-heap buffer, then copies it again to an off-heap buffer, then the OS reads that off-heap buffer to write the actual disk blocks (and typically adds its own buffers). In this case, NIO will eliminate that native-heap copies.
However, to compute a hash, you need to have the data in the Java heap, and the Mac SPI will move it there. So you don't get the benefit of NBI keeping the data off-heap, and IMO the "old IO" is easier to write.
Just don't forget that InputStream.read() is not guaranteed to read all the bytes you ask for.
来源:https://stackoverflow.com/questions/16067199/determining-appropriate-buffer-size