asynchronous IO io_submit latency in Ubuntu Linux

亡梦爱人 提交于 2019-11-27 08:59:27

Linux AIO (sometimes referred to as KAIO) is something of a black art where experienced practitioners know the gotchas but for some reason it's taboo to talk to someone about the gotchas they don't already know. From scratching around on the web and experience I've come up with a few examples where Linux's asynchronous I/O submission may become (silently) synchronous (thereby turning io_submit() into a blocking call):

  1. You're submitting buffered (aka non-direct) I/O. You're at the mercy of Linux's caching and your submit can go synchronous when
    • What you're requesting isn't already in the "read cache"
    • The "write cache" is full and the new request can't be accepted until some existing writeback has been completed
  2. You're asking for direct I/O to a file in a filesystem but for whatever reason the filesystem decides to ignore the O_DIRECT "hint" (e.g. how you submitted the I/O didn't meet O_DIRECT alignment constraints, filesystem or particular filesystem's configuration doesn't support O_DIRECT) and silently performs buffered I/O instead, resulting in the case above.
  3. You're doing direct I/O to a file in a filesystem but the filesystem has to do an synchronous operation (such as reading/updating metadata) in order to fulfill the I/O. Some filesystems such as XFS try harder to provide good AIO behaviour in comparison to others but even there a user still has to be very careful so as to avoid operations that will trigger synchrony.
  4. You're submitting too much outstanding I/O. Your disk/disk controller will have a maximum number of I/O requests that can be processed at the same time and there are maximum AIO request queue sizes for each specific device (see the /sys/block/[disk]/queue/nr_requests documentation and the un(der) documented /sys/block/[disk]/device/queue_depth) within the kernel. Making I/O requests back-up and exceed the size of the kernel queues leads to blocking.
    • If you submit I/Os that are "too large" (i.e. bigger than /sys/block/[disk]/queue/max_sectors_kb) they will be split up within the kernel and go on to chew up more than one request...
    • The system global maximum number of concurrent AIO requests (see the /proc/sys/fs/aio-max-nr documentation) can also have an impact but the result will be seen in io_setup() rather than io_submit()
  5. A layer in the Linux block device stack between you and the submission to the disk has to block. For example, things like Linux software RAID (md) can make I/O requests passing through it stall while it updates its RAID 1 metadata on individual disks.
  6. Your submission causes the kernel to wait because:
    • It needs to take a particular lock that is in use.
    • It needs to allocate some extra memory or page something in.

The list above is not exhaustive.

With modern kernels there's a bit more visibility of blocking due to the introduction of the RWF_NONBLOCK flag in the 4.14 kernel. Some of the blocking scenarios above (e.g. while using buffering and trying to read data not yet in the page cache) can be made noisy by the user - the RWF_NONBLOCK flag causes AIO submission to fail with EAGAIN in certain scenarios where blocking would occur. Obviously you would still a) need a 4.14 (or later) kernel that supports this flag and b) have to be aware of the cases it doesn't cover (but I notice there are patches that have been accepted or are being proposed to return EAGAIN in more scenarios that would otherwise block).

References:

Related:

Hopefully this post helps someone (and if does help you could you upvote it? Thanks!).

I speak as an author of proposed Boost.AFIO here.

Firstly, Linux KAIO (io_submit) is almost always blocking unless O_DIRECT is on and no extent allocation is required, and if O_DIRECT is on you need to be reading and writing 4Kb multiples on 4Kb aligned boundaries, else you force the device to do a read-modify-write. You therefore will gain nothing using Linux KAIO unless you rearchitect your application to be O_DIRECT and 4Kb aligned i/o friendly.

Secondly, never ever extend an output file during a write, you force an extent allocation and possibly a metadata flush. Instead fallocate the file's maximum extent to some suitably large value, and keep an internal atomic counter of the end of file. That should reduce the problem to just extent allocation which for ext4 is batched and lazy - more importantly you won't be forcing a metadata flush. That should mean KAIO on ext4 will be async most of the time, but unpredictably will synchronise as it flushes delayed allocations to the journal.

Thirdly, the way I'd probably approach your problem is to use atomic append (O_APPEND) without O_DIRECT nor O_SYNC, so what you do is append updates to an ever growing file in the kernel's page cache which is very fast and concurrency safe. You then, from time to time, garbage collect what data in the log file is stale and whose extents can be deallocated using fallocate(FALLOC_FL_PUNCH_HOLE) so physical storage doesn't grow forever. This pushes the problem of coalescing writes to storage onto the kernel where much effort has been spent on making this fast, and because it's an always forward progress write you will see writes hit physical storage in a fairly close order to the sequence they were written which makes power loss recovery straightforward. This latter option is how databases do it and indeed journalling filing systems do it, and despite the likely substantial redesign of your software you'll need to do this algorithm has been proven the best balance of latency to durability in a general purpose problem case.

In case all the above seems like a lot of work, the OS already provides all of the three techniques rolled together into a highly tuned implementation which is better known as memory maps: 4Kb aligned i/o, O_DIRECT, never extending the file, all async i/o. On a 64 bit system, simply fallocate the file to a very large amount and mmap it into memory. Read and write as you see fit. If your i/o patterns confuse the kernel page algorithms which can happen, you may need a touch of madvise() here and there to encourage better behaviour. Less is more with madvise(), trust me.

An awful lot of people try to duplicate mmaps using various O_DIRECT algorithms without realising mmaps already can do everything you need. I'd suggest exploring those first, if Linux won't behave try FreeBSD which has a much more predictable file i/o model, and only then delve into the realm of rolling your own i/o solution. Speaking as someone who does these all day long, I'd strongly recommend you avoid them whenever possible, filing systems are pits of devils of quirky and weird behaviour. Leave the never ending debugging to someone else.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!