Explanation for tiny reads (overlapped, buffered) outperforming large contiguous reads?

问题

(apologies for the somewhat lengthy intro)

During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.

The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.

The results were really surprising:

4MB x 256    : 5ms per request,    completion 25.8s @ ~40 MB/s
1MB x 1024   : 11.7ms per request, completion 23.3s @ ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s @ ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s @ ~75 MB/s

So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".

Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?

I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.

Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.

Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs @ ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.

It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).

(Possibly very, very distantly related to Memory-Mapped File is Faster on Huge Sequential Read?)

To make it a bit more illustrative:

Update:

Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.

I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).

When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.

For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).

Update 2:

Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.

Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.

Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.

回答1:

I'm not a filesystem expert but I think there are a couple of things going on here. First off. w.r.t. your comment about memory mapping being the winner. This isn't totally surprising since the NT cache manager is based on memory mapping - by doing the memory mapping yourself, you're duplicating the cache manager's behavior without the additional memory copies.

When you read sequentially from the file, the cache manager attempts to pre-fetch the data for you - so it's likely that you are seeing the effect of readahead in the cache manager. At some point the cache manager stops prefetching reads (or rather at some point the prefetched data isn't sufficient to satisfy your reads and so the cache manager has to stall). That may account for the slowdown on larger I/Os that you're seeing.

Have you tried adding FILE_FLAG_SEQUENTIAL_SCAN to your CreateFile flags? That instructs the prefetcher to be even more aggressive.

This may be counter-intuitive, but traditionally the fastest way to read data off the disk is to use asynchronous I/O and FILE_FLAG_NO_BUFFERING. When you do that, the I/O goes directly from the disk driver into your I/O buffers with nothing to get in the way (assuming that the segments of the file are contiguous - if they're not, the filesystem will have to issue several disk reads to satisfy the application read request). Of course it also means that you lose the built-in prefetch logic and have to roll your own. But with FILE_FLAG_NO_BUFFERING you have complete control of your I/O pipeline.

One other thing to remember: When you're doing asynchronous I/O, it's important to ensure that you always have an I/O request oustanding - otherwise you lose potential time between when the last I/O completes and the next I/O is started.

来源：https://stackoverflow.com/questions/5909345/explanation-for-tiny-reads-overlapped-buffered-outperforming-large-contiguous

标签

windows

winapi

overlapped-io

buffered