Packing two DWORDs into a QWORD to save store bandwidth

前端 未结 1 872
梦毁少年i
梦毁少年i 2020-12-11 22:21

Imagine a load-store loop like the following which loads DWORDs from non-contiguous locations and stores them contiguously:

top:
mov eax, DWORD          


        
相关标签:
1条回答
  • 2020-12-11 22:53

    It also seems like we could maybe get the shift for free by making the second load a QWORD load offset by -4, but then we are left clearing out garbage in the load DWORD.

    If wider loads are ok for correctness and performance (cache-line splits...), we can use shld

    top:
        mov eax, DWORD [rsi]
        mov rbx, QWORD [rdx-4]     ; unaligned(?) 64-bit load
    
        shld rax, rbx, 32          ; 1 uop on Intel SnB-family, 0.5c recip throughput
        mov QWORD [rdi], rax
    

    MMX punpckldq mm0, [mem] micro-fuses on SnB-family (including Skylake).

    top:
        movd       mm0, DWORD [rsi]
        punpckldq  mm0, QWORD [rdx]     ; 1 micro-fused uop on Intel SnB-family
    
        movq       QWORD [rdi], mm0
    
     ; required after the loop, making it only worth-while for long-running loops
     emms
    

    punpckl instructions unfortunately have a vector-width memory operand, not half-width. This often spoils them for uses where they'd otherwise be perfect (especially the SSE2 version where the 16B memory operand must be aligned). But note that the MMX versions (with only a qword memory operand) don't have an alignment requirement.

    You could also use the 128-bit AVX version, but that's even more likely to cross a cache line boundary and be slow. (Skylake does not optimize by loading only the required 8 bytes; a loop with an aligned mov + vpunckldq xmm1, xmm0, [cache_line-8] runs at 1 iter per 2 clocks vs. 1 iter per clock for aligned.) The AVX version is required to fault if the 16-byte load crosses into an unmapped page, so it couldn't just use a narrower load without extra support from the load port. :/

    Such a frustrating and useless design decision (presumably made before load ports could zero-extend for free, and not fixed with AVX). At least we have movhps as a replacement for memory-source punpcklqdq, but narrower widths that actually shuffle can't be replaced.


    To avoid CL-splits, you could also use a separate movd load and punpckldq, or SSE4.1 pinsrd. With this, there's no reason for MMX.

    top:
        movd       xmm0, DWORD [rsi]
    
        movd       xmm1, DWORD [rdx]           ; SSE2
        punpckldq  xmm0, xmm1
        ; or pinsrd  xmm0, DWORD [rdx], 1      ; 2 uops not micro-fused
    
        movq       QWORD [rdi], xmm0
    

    Obviously AVX2 vpgatherdd is a possibility, and may perform well on Skylake.

    0 讨论(0)
提交回复
热议问题