The Intel Optimization Reference, under Section 3.5.1, advises:
\"Favor single-micro-operation instructions.\"
\"Avoid using complex instructions (for example, e
Agner Fog's insn tables show which port micro-ops run on, which is all that matters for performance. It doesn't show exactly what each uop does, because that's not something you can reverse-engineer. (i.e. which execution unit it uses on that port).
It's easy to guess in some cases, though: haddps
on Haswell is 1 uop for port, and 2 uops for port 5. That's pretty obviously 2 shuffles (port 5) and an FP-add (port 1). There are lots of other execution units on port 5, e.g. vector boolean, SIMD integer add, and lots of scalar integer stuff, but given that haddps
needs multiple uops at all, it's pretty obvious that Intel implements it with shuffles and a regular "vertical" add uop.
It might be possible to figure out something about the dependency relationship between those uops (e.g. is it 2 shufps-style shuffles feeding an FP add, or is it shuffle-add-shuffle?). We also aren't sure whether the shuffles are independent of each other or not: Haswell only has one shuffle port, so the resource conflict would give us 5c total latency because the shuffles couldn't run in parallel even if they were independent.
Both shuffle uops probably need both inputs, so even if they're independent of each other, having one input ready sooner than the other doesn't improve the latency for the critical-path (from the slower input to the output).
If it was possible to implement HADDPS with 2 independent one-input shuffles, that would mean that HADDPS xmm0, xmm1 in a loop where xmm1 was a constant would only add 4c of latency to the dep chain involving xmm0. I haven't measured, but I think it's unlikely; almost certainly it's two independent 2-input shuffles to feeding an ADDPS uop.