Where is the Write-Combining Buffer located? x86

前端未结

关注

 3  1269

心在旅途 2020-12-05 12:20

How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants:

Between L1 and Memory controller
B

3条回答

粉色の甜心 (楼主)

2020-12-05 12:58

This patent states that a WC buffer is indeed any line fill buffer that gets marked with 'WC'.

The currently preferred embodiment uses a structure that already exists in the Intel™ Architecture microprocessor, the fill buffers. The fill buffers are a set of several cache lines with byte granularity valid and dirty bits, used by the out-of-order microprocessor to create a non-blocking cache. The WC buffer is a single fill buffer marked to permit WC stores to be merged. When evicted, the WC fill buffer waits until normal fill buffer eviction. In the currently preferred embodiment, only one write combining buffer is implemented. Physically, any fill buffer can used as the write combining buffer. Since only one logical write combining buffer is provided, when a second write combining buffer is needed, an eviction process is initiated

It then goes on to say that the WC buffer can be of a WB type as well as a USWC type. It could be using write combine buffer to mean 'line fill buffer' here, but I don't think so because in the sentence before it it uses it to refer to the WC buffer.

This leads me to believe that WC is not talking about USWC memory, but WC just being a property of a line fill buffer. In this case I'd imagine it's saying that one LFB can be used to combine writes from the store buffer (which may be of WB or maybe USWC type) but the other LFBs are used for eviction, prefetch etc. between L1 and L2 and do not allow stores to hit.

The x86-64 optimisation manual states: 'Write combining buffers are used for stores of all memory types' and 'Starting with Intel microarchitecture code name Nehalem, there are 10 buffers available for write combining'. We know nehalem has 10 LFBs, so this says to me that all 10 can be marked as WC as shown in figure 3 of the patent (which just happens to outline a scenario where only one LFB can be a WC buffer at a time).

It also states 'On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line. When a write to a write combining buffer for a previously-unwritten cache line occurs, there will be a read-for-ownership (RFO). If a subsequent write happens to another write-combining buffer, a separate RFO may be caused for that cache line. Subsequent writes to the first cache line and write-combining buffer will be delayed until the second RFO has been serviced to guarantee properly ordered visibility of the writes. If the memory type for the writes is write-combining, there will be no RFO since the line is not cached, and there is no such delay'.

A write combining buffer seems to be a special use case of a LFB which is used to combine writes while a RFO (*) is taking place so the stores can be completed and store buffer entries can be freed up (possibly multiple if they all write to the same cache line). The valid bits indicate the bits to merge into the cache line when it arrives in E state. My interpretation of the next part is that if write to a 2nd cache line occurs, then in order to write to the first line again, it needs to wait until the 1st and 2nd LFB are written (sequentially) to the L1d cache. This is so as to maintain correct order of global visibility of writes. I presume that the LFB is dumped to cache as soon as the line is present in cache and all writes to the line after that write directly to the cache line..

If the memory type is USWC then a RFO does not need to be performed but the writes are allocated to the buffer regardless.

Because PATs operate on virtual addresses, aliasing can occur. I.e. the same physical page can have multiple different cache policies. If a streaming store (means a USWC write opcode WCiL(F)) hits in the L3 cache, it causes a QPI WBMtoI of that line, sending it to the correct home agent based on SAD interleave rules, before the USWC store can occur. Presumably the L1/L2 cache also does this as the store passes through, although it might be left to the L3 to evict and write back the line if only one core has a copy. As for USWC loads, I don't actually know. There doesn't seem to be a separate opcode for this so it may set a flag in a DRd request to indicate it is a non temporal load. I'm not sure whether the L3 cache can forward aliased cache lines to the USWC read request or whether they have to be written back and the read request has to be satisfied from DRAM (I say DRAM but the memory controller also probably has a store to load forwarding mechanism, so I should say home agent)

I'm not sure how the 'non-temporal hint' stores / loads work. The Intel volume 1 manual seems to be saying that the hint in the store buffer forces all stores other than WP and UC(-) to be interpreted by the L1d controller as USWC, whereas the hint does not change the policy for loads i.e. does nothing. Maybe the hint has an extra benefit in the store buffer. The memory scheduler does not know the cache policy of the load/store until the data is returned by the L1d controller, so the hint tells it that weak ordering applies and they can be dispatched more efficiently; I think non temporal writes can be reordered with other writes.

(*) I don't know whether a S->E request results in a line fill buffer allocation for a write or whether it can be written to the cache immediately. I'm going to say it does allocate a LFB because it could lose this data if it stores it in the cache line temporarily while sending an S->E request because an invalidate request from L3 in response to another core could come in first. I say S->E request because I don't know what this is called. It could be encapsulated as an RFO packet but with a flag indicating the read isn't necessary, or it could be the so called ItoM, which has conflicting definitions. Some sources call it a RFO but a full cache line write is intended, meaning that the cache doesn't need to be read if it it's in an I state. This may potentially be used for S->E transitions as well. Instead of being called S/I->E it's called ItoM to indicate the intent to write to the line, but I don't know why ItoE wouldn't also mean this. Funnily enough there are actually 2 different UPI opcodes for multisocket cache coherency, InvItoE and InvItoM, both with the same description exception the latter adding 'with the intent of performing a writeback soon afterward'

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...