SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

南楼画角 提交于 2019-12-06 04:54:48

TL:DR: use punpckl/hdq to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?

Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm / movhps [rdx+r9], xmm.


The code is almost twice times as slow on Haswell/Skylake than on IVB

Your code bottlenecks heavily on shuffle throughput.

Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.

I see you already found David Kanter's excellent Haswell microarch write-up.

It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.


You have some simple missed-optimizations, even before we think about major changes:

  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

should be movhps [rdx+r13],xmm6. On Sandybridge and Haswell, movhps is a pure store uop, with no shuffle uop required.

pextrd eax,xmm3,0 is always worse than movd eax, xmm3; never use pextrd with an immediate 0. (Also, pextrd directly to memory might be a win. You could do a 64-bit movq and then overwrite half of that with a 32-bit pextrd. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.


Source and destination format is not a degree of freedom in image manipulation.

Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.

This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8] / [rcx+r8]. It also lets you load two rows with one movdqa. And it gives you good memory locality for accessing 8x8 blocks.

Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.


  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

extract/insert to/from integer is not very efficient; pinsrd and pextrd are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd to memory in 32-bit chunks.

Also consider using SSSE3 pshufb which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por. (You might use pshufb instead of punpcklbw).


Another option is to use shufps to combine data from two sources. You might need another shuffle after that. Or use punpckldq.

// "low" dwords from XMM1:xmm2
//  high dwords from XMM3:xmm4

;  xmm1:  [ a b c d ]   xmm2: [ e f g h ]
;  xmm3:  [ i j k l ]   xmm4: [ m n o p ]

; want: [ a i b j ] / [ c k d l ] / ... I think.

;; original: replace these with
;  movdqa  xmm5,xmm1     ; xmm5 = [ a b c d ]
;  pextrd  eax,xmm3,0    ; eax = i
;  pinsrd  xmm5,eax,1    ; xmm5 = [ a i ... ]
;  movq    [rdx],xmm5

;  movdqa  xmm5,xmm3       ; xmm5 = [ i j k l ]
;  pextrd  eax,xmm1,1      ; eax = b
;  pinsrd  xmm5,eax,0      ; xmm5 = [ b j ... ]
;  movq    [rdx+r9],xmm5

Replace with this:

   movdqa    xmm5, xmm1
   punpckldq xmm5, xmm3     ; xmm5 = [ a i b j ]
   movq     [rdx], xmm5
   movhps   [rdx+r9], xmm5  ; still a pure store, doesn't cost a shuffle

So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).

Use punpckhdq for [ c k d l ], where it's even better because we're replacing movhlps as well.

 ;  movdqa  xmm5,xmm1       ; xmm5 = [ a b c d ]
 ; pextrd  eax,xmm3,2      ; eax = k
 ; pinsrd  xmm5,eax,3      ; xmm5 = [ a b c k ]
 ; MOVHLPS  xmm6,xmm5      ; xmm6 = [ c k ? ? ]  (false dependency on old xmm6)
 ; movq   [rdx+2*r9],xmm6

Then unpack lo/hi for xmm2 and xmm4.

Using AVX or AVX2 would let you skip the movdqa, because you can unpack into a new destination register instead of copy + destroy.

Aki Suihkonen

Inserting dword is more if not most efficiently done with combination of pshufd and an immediate blend.

 pshufd xmm5, xmm3, 0x55 * slot
 pblendw xmm1, xmm5, 3 << dst_slot

pblendw is SSE4.1, but is of course available on haswell. Unfortunately it only runs on port 5 on Haswell/Skylake, so it still competes with shuffles.

AVX2 vpblendd runs on any vector-ALU port (p0/p1/p5) on Haswell/Skylake, so is much more efficient than word-granularity pblendw / vpblendw.

If you need to avoid AVX2, consider using SSE4.1 blendps to blend 32-bit elements with an immediate control. It runs on any port on Haswell (or p0/p5 on Sandybridge vs. p1/p5 for shuffles), and the latency penalty for using it on integer data should be irrelevant for your case.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!