How does Linux kernel flush_write_buffers() work on x86?

问题

The following code is from include/asm-i386/io.h, and it is invoked from dma_map_single(). My understanding is that flush_write_buffers() is supposed to flush CPU memory cache before mapping the memory for DMA. But how does this assembly code flush CPU cache?

static inline void flush_write_buffers(void)
{
    __asm__ __volatile__ ("lock; addl $0,0(%%esp)": : :"memory");
}

回答1:

The Intel Pentium Pro processors had a bug wherein a store to a memory location of type UC may be reordered with earlier memory accesses to locations of type WC, which violates the x86 memory consistency model. As a workaround, a correctly implemented memory serializing instruction can be used just before the UC store. On the Pentium Pro processors, any the following would do the job: (1) cpuid, (2) a UC load, or (3) a lock-prefixed instruction.

The flush_write_buffers in the Linux kernel uses a lock-prefixed instruction for precisely this purpose. cpuid is the most expensive and unnecessary for this purpose. A UC load requires a memory location of type UC, which is a little inconvenient in general. Hence, the choice of using a lock-prefixed instruction.

As the name of the function indicates, the purpose of it is to wait until all pending writes in the write buffer (a.k.a. store buffer, in this context) become globally observable. The caches are not affected.

This bug only affects Pentium Pro and the kernel had to be compiled with CONFIG_X86_PPRO_FENCE for the workaround to be enabled. It was difficult, though, to be sure that the workaround is used in all the places in the kernel where it's supposed to be used. Moreover, CONFIG_X86_PPRO_FENCE didn't only affect the operation of flush_write_buffers, but also other constructs, so it can cause significant performance degradation. Eventually, it was dropped from the kernel starting with v4.16-rc7.

回答2:

What you are seeing is a memory fence. What that instruction does is guaranteeing that all preceding load and store instructions become globally visible to any following load or store instructions.

A fence acts as a barrier, with the effect of flushing CPU buffers (note: buffers, not cache, that's a different thing) because data that was waiting to be written needs to be made globally available right away before continuing, in order to ensure that successive instructions will fetch the correct data.

This function was introduced to get around an hardware problem in an old family of Intel CPUs, namely the Pentium Pro (1995-98), which caused memory accesses operations under specific circumstances to be executed in the wrong order.

Nowdays the canonical way of applying a fence in x86 is through the use of the mfence, lfence or sfence instructions (depending oh the type of fence needed), but those were only later added (with SSE and SSE2). On the Pentium Pro, no such instructions were available.

The lock instruction is really just an instruction prefix, so this:

lock
addl $0,0(%esp)

Is actually a "locked add".

The lock prefix is used for opcodes that perform a read-modify-write operation to make them atomic. When applying lock add $0, 0(%esp), in order for the instruction to be atomic and therefore for the result to be immediately globally visible, a load+store fence is implicitly applied. The the top of the stack is always readable and writable, and adding 0 is a no-op, so there's no need to pass a valid address to the function. This workaround therefore permits the correct serialization of memory access, and it's the fastest type of instruction to accomplish the goal on the Intel Pentium Pro.