How should I approach to find number of pipeline stages in my Laptop's CPU

浪尽此生 提交于 2020-12-23 08:20:25

问题


I want to look into how latest processors differs from standard RISC V implementation (RISC V having 5 stage pipeline - fetch, decode, memory , ALU , Write back) but not able to find how should I start approaching the problem so as to find the current implementation of pipelining at processor

I tried referring Intel documentation for i7-4510U documentation but it was not much help


回答1:


Haswell's pipeline length is reportedly 14 stages (on a uop-cache hit), 19 stages when fetching from L1i for legacy decode. The only viable approach for finding it is to look it up from articles about that microarchitecture. You can't exactly measure it.


A lot of what we know about Intel and AMD CPU internals is based on presentations at chip conferences by the vendors, their optimization manuals, and their patents. You can't truly measure it with a benchmark, but it's related to the branch mispredict penalty. Note that pipelined execution units each have their own pipelines, and the memory pipeline is also kinda separate.

Your CPU's cores are Intel's Haswell microarchitecture. See David Kanter's deep dive on its internals: https://www.realworldtech.com/haswell-cpu/.

It's a superscalar out-of-order exec design, not a simple in-order like a classic RISC that you're thinking of. Required background reading: Modern Microprocessors A 90-Minute Guide! covers the evolution of cpu architecture from simple non-pipelined, to pipelined, superscalar, and out-of-order execution.

It has sizeable buffers between some pipeline stages, not just a simple latch; its branch prediction works so well that it's usually more useful for it to hide fetch bubbles by buffering multiple bytes of machine code. With no stalls anywhere, the issue/rename stage is the narrowest point in the pipeline, so front-end buffers between stages will tend to fill up. (In Haswell, uop-cache fetch is reportedly only 4 uops per clock, too. Skylake widened that to 6, up to a whole uop cache line read into the IDQ.)


https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client) reports the pipeline length as "14-19" stages, which counts from uop-cache fetch or from L1i cache fetch. (Uop cache hits shorten the effective length of the pipeline, cutting out decode.) https://www.anandtech.com/show/6355/intels-haswell-architecture/6 says the same thing.

Also https://www.7-cpu.com/cpu/Haswell.html measured the mispredict penalty at 15.0 cycle for a uop cache hit, 18-20 cycles for a uop-cache miss (L1i cache hit). That's correlated to the length of part of the pipeline.

Note that the actual execution units in the back-end each have their own pipeline, e.g. the AVX FMA units on ports 0 and 1 are each 5 stages long. (vmulps / vfma...ps latency of 5 cycles on Haswell.) I don't know whether that 14 - 19 cycle length of the whole pipeline is counting execution as 1 cycle, because typical integer ALU instructions like add have only 1 cycle latency. (And 4/clock throughput.) Slower integer ALU instructions like imul, popcnt, and bsf can only execute on port 1, where they have 3 cycle latency.

The store buffer also entirely decouples store commit to L1d cache from execution of store instructions. This can have an impact on interrupt latency if the store buffer is full of a bunch of retired cache-miss stores. Being retired from the ROB, they can't be discarded, and have to definitely happen. So they'll block any store done by the interrupt handler from committing until they drain. Or block any serializing instruction (including iret) from retiring; x86 "serializing" instructions are defined as emptying the whole pipeline.

Haswell's store buffer is 42 entries large, and can commit to L1d cache at 1/clock assuming no cache misses. Or many more with cache misses. Of course, the store buffer isn't a "pipeline", physical it's likely a circular buffer that's read by some logic that tries to commit the head to L1d cache. This logic is fully separate from the store execution units (which write address and data into the store buffer). So the size of the store buffer affects how long it can take to drain "the pipeline" in a general sense, but in terms of a pipeline of connected stages from fetch to retirement it's not really that.

Even the out-of-order execution back end can have a very long dependency chain in flight that would take a long time to wait for. e.g. a chain of sqrtsd instructions might be the slowest thing you could queue up. (Max latency per uop). e.g. like in this Meltdown exploit example that needs to create a long shadow for speculative execution after a fault. **So the time to drain the back-end can be much longer than the "pipeline length". (But unlike the store buffer, these uops can simply be discarded on an interrupt, rolling back to the consistent retirement state.)

(Also related to long dep chains: Are loads and stores the only instructions that gets reordered? and Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths)


There isn't a simple way to tell from microbenchmarking.

Pipeline length is not really directly meaningful. The performance-relevant characteristic that's related to pipeline length is the branch mispredict penalty. See What exactly happens when a skylake CPU mispredicts a branch?. (And I guess also part of the I-cache miss penalty; how long after data arrives from off-core can the back end start executing anything.) Thanks to out-of-order execution and fast recovery, branch misprediction penalty can sometimes be partly overlapped with slow "real work" in the back-end. Avoid stalling pipeline by calculating conditional early

What people generally try to actually measure is branch mispredict penalty. If you're curious, https://www.7-cpu.com/ is open-source. You could have a look at their code for testing.

lfence to drain the out-of-order back-end has unknown amounts of overhead beyond just the length of the pipeline, so you can't just use that. You could make a big block of just back-to-back lfence to measure lfence throughput, but with nothing between lfences we get about 1 per 4.0 cycles; I guess because it doesn't have to serialize the front-end which is already in-order. https://www.uops.info/table.html.

And rdtsc itself is pretty slow, which makes writing microbenchmarks an extra challenge. Often you have to put stuff in a loop or unrolled block and run it many times so timing overhead becomes negligible.


RISC-V doesn't have to be 5-stage

The standard RISC-V implementations include an unpipelined core, and 2, 3, and 5-stage pipelined cores, and an out-of-order implementation. (https://riscv.org//wp-content/uploads/2017/05/riscv-spec-v2.2.pdf).

It doesn't have to be implemented as a classic 5-stage RISC, although that would make it very much like classic MIPS and would be normal for teaching CPU-architecture and pipelining.

Note that the classic-RISC pipeline (with 1 mem stage, and address calculation done in EX) requires an L1d access latency of 1 cycle, so that's not a great fit for modern high-performance designs with high clocks and large caches. e.g. Haswell has L1d load latency of 4 or 5 cycles. (See Is there a penalty when base+offset is in a different page than the base? for more about the 4-cycle special case shortcut where it guesses the final address to start TLB lookup in parallel with address-generation.)



来源:https://stackoverflow.com/questions/64623260/how-should-i-approach-to-find-number-of-pipeline-stages-in-my-laptops-cpu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!