Where is the Write-Combining Buffer located? x86

前端 未结 3 1267
心在旅途
心在旅途 2020-12-05 12:20

How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants:

  • Between L1 and Memory controller
  • B
3条回答
  •  温柔的废话
    2020-12-05 12:56

    In modern Intel CPUs, write-combining is done by the LFBs (line-fill-buffers), also used for other pending transfers from L1 <-> L2. Each core has 10 of these (since Nehalem). (Transfers between L2 and L3 use different buffers, called the "superqueue").

    That's why Intel recommends avoiding too much other traffic when doing NT stores, to avoid early flushes of partially-filled LFBs caused by demand-loads allocating LFBs. https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers

    The "inside" of the LFBs have connections to L1d, the store buffer, and load ports.

    The "outside" of the LFBs can talk to L2 or (probably with L2's help) go over the ring bus / mesh to memory controllers, or L3 for NT prefetch. Going off-core is probably not very different for L3 vs. memory; just a different type of message to send on the ring / mesh interconnect between cores; in Intel CPUs, the memory controllers are just another stop on the ring bus (in the "system agent), like other cores with their slices of L3. @BeeOnRope suggests that L1 LFBs aren't really directly connected to the ring bus, and that requests that don't put data into L2 probably still go through the L2 superqueue buffers to the ring bus / mesh. This seems likely, so each core only needs one point of presence on the ring bus and arbitration for it between L2 and L1 happens inside the core.


    NT store data enters an LFB directly from the store buffer, as well as probing L1d to see if it needs to evict that line first.

    Normal store data enters an LFB when its evicted from L1d, either to make room for a new line being allocated or in response to an RFO from another core that wants to read that line.

    Normal loads (and stores) that miss in L1d need the cache to fetch that line, which also allocates an LFB to track the incoming line (and the request to L2). When data arrives, it's sent straight to a load buffer that's waiting for it, in parallel with placing it in L1d. (In CPU architecture terms, see "early restart" and "critical word first": the cache miss only blocks until the needed data arrives, the rest of the cache line arrives "in the background".) You (and the CPU architects at Intel) definitely don't want L2 hit latency to include placing the data in L1d and getting it back out again.

    NT loads from WC memory (movntdqa) read directly from an LFB; the data never enters cache at all. LFBs already have a connection to load ports for early-restart of normal loads, so SSE4 was able to add movntdqa without a lot of extra cost in silicon, I think. It is special in that a miss will only fill an LFB directly from memory, bypassing L3/L2/L1, though. NT stores already need the LFBs to be able to talk to memory controllers.

提交回复
热议问题