WC vs WB memory? Other types of memory on x86_64?

问题

Could you describe the meanings and the differences between WC and WB memory on x86_64? For completeness, please, describe other types of memory on x86_64, if any.

回答1:

I will first start with Writeback caching (WB) since it is easier to understand.

Writeback caching

As the name implies this caching strategy tries to delay the writes to the system memory as long as possible.
The idea is to use only the cache, ideally.

However, since the cache has a finite size smaller than the finite size of memory and its internal organization^{see Wikipedia's Cache article for an introduction} introduces some conflicting aliasing, occasionally a cache line must be evicted to memory.
This is a writeback event - the other source of writeback events is the cache coherency mechanism^{see MESI}.

This is the basic idea of WB, more technically a caching strategy is defined by the actions taken upon four main events.

Event              | Action
-------------------+----------------------------------------------
Read hit           | Read from the cache line
-------------------+----------------------------------------------
Read miss          | Fill the line then read from the cache line
-------------------+----------------------------------------------
Write hit          | Write to the cache line
-------------------+----------------------------------------------
Write miss         | Fill the line then write to the cache line*

* Since P6, we will assume this is always the case

As you can see there is no write to memory. That is a side effect of the limits of the caches that, in turn, force an eviction.
Other caching attributes like the replacement policy, coherency, speculative behaviour allowed (and how all this turns out to influence visibility and ordering of the memory accesses) are left out of this answer as they are orthogonal to the caching strategy.
Intel sets some of these attributes implicitly for each strategy but that's just for sparing some configuration bit.

For completeness, here is a comparison table with other caching strategies.
Write combining (WC) is intentionally left out.

Legend
   LF = Line Fill;         LR = Read from line;   MR = Read from memory
   LW = Write to line;     MW = Write to memory

Event              | WB      |  UC | WT       | WP     |
-------------------+---------+-----+----------+--------+----------
Read hit           | LR      | RM* | LR       | LR     |
-------------------+---------+-----+----------+--------+----------
Read miss          | LF, LR  | RM  | LF, LR   | LF, LR |
-------------------+---------+-----+----------+--------+----------
Write hit          | LW      | WM  | WL, WM** | WM***  |
-------------------+---------+-----+----------+--------+----------
Write miss         | LF, LW  | WM* | WM       | WM     |

* A hit on a UC region can happen if the cache type has been
  changed without invalidating the cache. In truth for UC talking about
  hits is a bit misleading. Caching is bypassed, so it is actually a
  N.A. case.   

** The line can be invalidated as the result of the eviction operation
   used.

*** This can never happen, WP doesn't use the cache for writes so this
    is actually an N.A. case.  
    Writing directly to memory by bypassing the caches also invalidates
    all the copies of the affected line in other processors.

Remark: The difference between WP and WT is that the latter can be seen as "piercing" the caches while former goes "around" the caches.

Write combining

WC is not really a caching strategy but since it is tightly coupled to the caching strategies it is discussed along these.
I've found the Intel SDMs quite confusing on this topic, this is my interpretation of the matter so far.
Corrections are welcomed!

There is an old paper from Intel available here on WC - its content has been included in the section 11.3.1 of the Intel SDM 3 but in doing that, Intel lost some context and structure that was available in the original document.

The idea of WC is to coalesce writes before or during a long lasting operation like a bus transaction or a cache coherency transaction.

In order to achieve that, the processor has a number of WC buffers - don't confuse these with the store buffer of the cache lines.
A WC buffer is a separate entity altogether!

The number of WC buffers are limited: In P6 (and consequently in Pentium M) there were 6 WC buffers, the Pentium 4 had 8 and since Nehalem there are 10 WC buffers^{see section 3.6.10 of Intel Optimisation manual}.

The size of each WC buffer is not architecturally defined but up to now it has always been the same size of a cache line (so 32 bytes for P6, 64 bytes for the others).

The WC buffers sit before the caches.

Now the twist, WC really means two things:

Before the CPU is allowed to write to a cache line, in the case of a write miss, it must execute a RFO (Read For Ownership) to inform the other CPUs that their cached data, if any, is not invalid.
Since this can take a long time, the CPU parks the store in the WC buffer and goes on with its work. The cache subsystem will then autonomously move the WC buffer to the appropriate cache.
If successive store to the same line, for which an RFO is being performed, arrive, they are all sunk in the WC buffer.
If the memory type involves no caching then the WC buffer acts as a parking lot for the stores before they are transferred to the bus altogether.
Starting a bus transaction takes some overhead but once it is started, data can be transferred in successive bursts (actually 8 bytes each) quite fast.
Writing less data than the memory bus width is a waste of memory bandwidth - pretty much like making a bus (a real bus, the one for people) run half empty. So WC is attempting to exploit the full memory bus width.

Remark: The UC memory types bypass all the cache subsystem, including the WC buffer. The WC memory type bypasses the caches but not the WC buffer.

In the first case, the WC allows the CPU to continue its work while a long lasting operation is in progress, in the second case it allows for an efficient transfer of data.
Intel uses the same term WC for both cases, leading to confusion in my opinion.
The Intel SDM reads that WC is allowed only on WC memory type regions yet later it claims that is used for all memory types and report it in the description of each cache strategy.
I believe this is because Intel is referring to the two WCs described above.
I believe the second type was the original one and that WC buffers have been reused when MESI became popular.

EDIT

I forgot to mention that the WC memory type is uncacheable (yet speculative, speculation has been intentionally left out).
Peter Cordes summed it up perfectly in his comment:

WC is short-hand for USWC: "Uncacheable Speculative Write-Combining".
The key point is that it's an uncacheable memory type, which is why movntdqa loads are useful (to load a [store buffer entry, that is as wide as a] full cache line, instead of doing separate accesses to DRAM for separate loads [residing in an address range that could be cached ] from the same cache line)

I just added the text in the brackets because I believe he used the term "cache line" as a measurement unit and as a distinguishing property of memory addresses.

Only four WC buffers can be active at a given moment.
Intel optimisation manual also reports that only two can be used to combine writes to caches.

When using WC to combine stores to memory no coherence is enforced by the processor.
WC buffers are not snooped, further, no order is enforced between currently active WC buffers (which one is spilt first).
When a WC must be evicted it is by either performing a single bus transaction if the WC buffer is all dirty, by performing up to 8 (64 / 8) or 4 (32 / 8) bus transactions, if the WC buffer contains partial data.
Again when spilling a WC buffer with partial data no ordering is enforced between the chunks.

The P6 evicted a WC buffer as soon as the software wrote outside the buffer size, the exact algorithm is implementation defined - there is usually a time window to allow the software to coalesce the writes.
It's possible to force an eviction by accessing a UC memory type region or upon specific events (see the paper linked at the beginning of this section).

来源：https://stackoverflow.com/questions/45623007/wc-vs-wb-memory-other-types-of-memory-on-x86-64

标签

assembly

memory

x86-64

cpu-cache

memory-mapping