About the RIDL vulnerabilities and the “replaying” of loads

后端 未结 2 523
温柔的废话
温柔的废话 2020-12-19 16:55

I\'m trying to understand the RIDL class of vulnerability.

This is a class of vulnerabilities that is able to read stale data from various micro-architectural buff

2条回答
  •  星月不相逢
    2020-12-19 17:21

    replay = being dispatched again from the RS (scheduler). (This isn't a complete answer to your whole question, just to the part about what replays are. Although I think this covers most of it, including unblocking dependent uops.)

    parts of this answer have a misunderstanding about load replays.

    See discussion in chat - uops dependent on a split or cache-miss load get replayed, but not the load itself. (Unless the load depends on itself in a loop, like I had been doing for testing >.<). TODO: fix the rest of this answer and others.


    It turns out that a cache-miss load doesn't just sit around in a load buffer and wake up dependent uops when the data arrives. The scheduler has to re-dispatch the load uop to actually read the data and write-back to a physical register. (And put it on the forwarding network where dependent uops can read it in the next cycle.)

    So L1 miss / L2 hit will result in 2x as many load uops dispatched. (The scheduler is optimistic, and L2 is on-core so the expected latency of an L2 hit is fixed, unlike time for an off-core response. IDK if the scheduler continues to be optimistic about data arriving at a certain time from L3.)


    The RIDL paper provides some interesting evidence that load uops do actually directly interact with LFBs, not waiting for incoming data to be placed in L1d and just reading it from there.


    We can observe replays in practice most easily for cache-line-split loads, because causing that repeatedly is even more trivial than cache misses, taking less code. The counts for uops_dispatched_port.port_2 and port_3 will be about twice as high for a loop that does only split loads. (I've verified this in practice on Skylake, using essentially the same loop and testing procedure as in How can I accurately benchmark unaligned access speed on x86_64)

    Instead of signalling successful completion back to the RS, a load that detects a split (only possible after address-calculation) will do the load for the first part of the data, putting this result in a split buffer1 to be joined with the data from the 2nd cache line the 2nd time the uop dispatches. (Assuming that neither time is a cache miss, otherwise it will take replays for that, too.)


    When a load uop dispatches, the scheduler anticipates it will hit in L1d and dispatches dependent uops so they can read the result from the forwarding network in the cycle the load puts them on that bus.

    If that didn't happen (because the load data wasn't ready), the dependent uops will have to be replayed as well. Again, IIRC this is observable with the perf counters for dispatch to ports.


    Existing Q&As with evidence of uop replays on Intel CPUs:

    • Why does the number of uops per iteration increase with the stride of streaming loads?
    • Weird performance effects from nearby dependent stores in a pointer-chasing loop on IvyBridge. Adding an extra load speeds it up?
    • How can I accurately benchmark unaligned access speed on x86_64 and Is there a penalty when base+offset is in a different page than the base?
    • Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths points out that the possibility of replay mean the RS needs to hold on to a uop until an execution unit signals successful completion back to the RS. It can't drop a uop on first dispatch (like I guessed when I first wrote that answer).

    Footnote 1:

    We know there are a limited number of split buffers; there's a ld_blocks.no_sr counter for loads that stall for lack of one. I infer they're in the load port because that makes sense. Re-dispatching the same load uop will send it to the same load port because uops are assigned to ports at issue/rename time. Although maybe there's a shared pool of split buffers.


    RIDL:

    Optimistic scheduling is part of the mechanism that creates a problem. The more obvious problem is letting execution of later uops see a "garbage" internal value from an LFB, like in Meltdown.

    http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/ even shows that meltdown loads in PPro expose various bits of microarchitectural state, exactly like this vulnerability that still exists in the latest processors.

    The Pentium Pro takes the “load value is a don’t-care” quite literally. For all of the forbidden loads, the load unit completes and produces a value, and that value appears to be various values taken from various parts of the processor. The value varies and can be non-deterministic. None of the returned values appear to be the memory data, so the Pentium Pro does not appear to be vulnerable to Meltdown.

    The recognizable values include the PTE for the load (which, at least in recent years, is itself considered privileged information), the 12th-most-recent stored value (the store queue has 12 entries), and rarely, a segment descriptor from somewhere.

    (Later CPUs, starting with Core 2, expose the value from L1d cache; this is the Meltdown vulnerability itself. But PPro / PII / PIII isn't vulnerable to Meltdown. It apparently is vulnerable to RIDL attacks in that case instead.)

    So it's the same Intel design philosophy that's exposing bits of microarchitectural state to speculative execution.

    Squashing that to 0 in hardware should be an easy fix; the load port already knows it wasn't successful so masking the load data according to success/fail should hopefully only add a couple extra gate delays, and be possible without limiting clock speed. (Unless the last pipeline stage in the load port was already the critical path for CPU frequency.)

    So probably an easy and cheap fix in hardware for future CPU, but very hard to mitigate with microcode and software for existing CPUs.

提交回复
热议问题