Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

后端 未结 3 2076
滥情空心
滥情空心 2020-12-18 13:42

In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this:

#          


        
3条回答
  •  眼角桃花
    2020-12-18 13:57

    TL;DR: It's almost always best to just do four broadcast-loads using _mm256_set1_pd(). This very good on Haswell and later, where vbroadcastsd ymm,[mem] doesn't require an ALU shuffle operation, and usually also the best option for Sandybridge/Ivybridge (where it's a 2-uop load + shuffle instruction).

    It also means you don't need to care about alignment at all, beyond natural alignment for a double.

    The first vector is ready sooner than if you did a two-step load + shuffle, so out-of-order execution can potentially get started on the code using these vectors while the first one is still loading. AVX512 can even fold broadcast-loads into memory operands for ALU instructions, so doing it this way will allow a recompile to take slight advantage of AVX512 with 256b vectors.


    (It's usually best to use set1(x), not _mm256_broadcast_sd(&x); If the AVX2-only register-source form of vbroadcastsd isn't available, the compiler can choose to store -> broadcast-load or to do two shuffles. You never know when inlining will mean your code will run on inputs that are already in registers.)

    If you're really bottlenecked on load-port resource-conflicts or throughput, not total uops or ALU / shuffle resources, it might help to replace a pair of 64->256b broadcasts with a 16B->32B broadcast-load (vbroadcastf128/_mm256_broadcast_pd) and two in-lane shuffles (vpermilpd or vunpckl/hpd (_mm256_shuffle_pd)).

    Or with AVX2: load 32B and use 4 _mm256_permute4x64_pd shuffles to broadcast each element into a separate vector.


    Source Agner Fog's insn tables (and microarch pdf):

    Intel Haswell and later:

    vbroadcastsd ymm,[mem] and other broadcast-load insns are 1uop instructions that are handled entirely by a load port (the broadcast happens "for free").

    The total cost of doing four broadcast-loads this way is 4 instructions. fused-domain: 4uops. unfused-domain: 4 uops for p2/p3. Throughput: two vectors per cycle.

    Haswell only has one shuffle unit, on port5. Doing all your broadcast-loads with load+shuffle will bottleneck on p5.

    Maximum broadcast throughput is probably with a mix of vbroadcastsd ymm,m64 and shuffles:

    ## Haswell maximum broadcast throughput with AVX1
    vbroadcastsd    ymm0, [rsi]
    vbroadcastsd    ymm1, [rsi+8]
    vbroadcastf128  ymm2, [rsi+16]     # p23 only on Haswell, also p5 on SnB/IvB
    vunpckhpd       ymm3, ymm2,ymm2
    vunpcklpd       ymm2, ymm2,ymm2
    vbroadcastsd    ymm4, [rsi+32]     # or vaddpd ymm0, [rdx+something]
    #add             rsi, 40
    

    Any of these addressing modes can be two-register indexed addressing modes, because they don't need to micro-fuse to be a single uop.

    AVX1: 5 vectors per 2 cycles, saturating p2/p3 and p5. (Ignoring cache-line splits on the 16B load). 6 fused-domain uops, leaving only 2 uops per 2 cycles to use the 5 vectors... Real code would probably use some of the load throughput to load something else (e.g. a non-broadcast 32B load from another array, maybe as a memory operand to an ALU instruction), or to leave room for stores to steal p23 instead of using p7.

    ## Haswell maximum broadcast throughput with AVX2
    vmovups         ymm3, [rsi]
    vbroadcastsd    ymm0, xmm3         # special-case for the low element; compilers should generate this from _mm256_permute4x64_pd(v, 0)
    vpermpd         ymm1, ymm3, 0b01_01_01_01   # NASM syntax for 0x99
    vpermpd         ymm2, ymm3, 0b10_10_10_10
    vpermpd         ymm3, ymm3, 0b11_11_11_11
    vbroadcastsd    ymm4, [rsi+32]
    vbroadcastsd    ymm5, [rsi+40]
    vbroadcastsd    ymm6, [rsi+48]
    vbroadcastsd    ymm7, [rsi+56]
    vbroadcastsd    ymm8, [rsi+64]
    vbroadcastsd    ymm9, [rsi+72]
    vbroadcastsd    ymm10,[rsi+80]    # or vaddpd ymm0, [rdx + whatever]
    #add             rsi, 88
    

    AVX2: 11 vectors per 4 cycles, saturating p23 and p5. (Ignoring cache-line splits for the 32B load...). Fused-domain: 12 uops, leaving 2 uops per 4 cycles beyond this.

    I think 32B unaligned loads are a bit more fragile in terms of performance than unaligned 16B loads like vbroadcastf128.


    Intel SnB/IvB:

    vbroadcastsd ymm, m64 is 2 fused-domain uops: p5 (shuffle) and p23 (load).

    vbroadcastss xmm, m32 and movddup xmm, m64 are single-uop load-port-only. Interestingly, vmovddup ymm, m256 is also a single-uop load-port-only instruction, but like all 256b loads, it occupies a load port for 2 cycles. It can still generate a store-address in the 2nd cycle. This uarch doesn't deal well with cache-line splits for unaligned 32B-loads, though. gcc defaults to using movups / vinsertf128 for unaligned 32B loads with -mtune=sandybridge / -mtune=ivybridge.

    4x broadcast-load: 8 fused-domain uops: 4 p5 and 4 p23. Throughput: 4 vectors per 4 cycles, bottlenecking on port 5. Multiple loads from the same cache line in the same cycle don't cause a cache-bank conflict, so this is nowhere near saturating the load ports (also needed for store-address generation). That only happens on the same bank of two different cache lines in the same cycle.

    Multiple 2-uop instructions with no other instructions between is the worst case for the decoders if the uop-cache is cold, but a good compiler would mix in single-uop instructions between them.

    SnB has 2 shuffle units, but only the one on p5 can handle shuffles that have a 256b version in AVX. Using a p1 integer-shuffle uop to broadcast a double to both elements of an xmm register doesn't get us anywhere, since vinsertf128 ymm,ymm,xmm,i takes a p5 shuffle uop.

    ## Sandybridge maximum broadcast throughput: AVX1
    vbroadcastsd    ymm0, [rsi]
    add             rsi, 8
    

    one per clock, saturating p5 but only using half the capacity of p23.

    We can save one load uop at the cost of 2 more shuffle uops, throughput = two results per 3 clocks:

    vbroadcastf128  ymm2, [rsi+16]     # 2 uops: p23 + p5 on SnB/IvB
    vunpckhpd       ymm3, ymm2,ymm2    # 1 uop: p5
    vunpcklpd       ymm2, ymm2,ymm2    # 1 uop: p5
    

    Doing a 32B load and unpacking it with 2x vperm2f128 -> 4x vunpckh/lpd might help if stores are part of what's competing for p23.

提交回复
热议问题