SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling.

I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0)

From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5.

I googled a bit, but couldn't find a explanation. Is haswell really slower with legacy SSE, or did I just hit some extreme corner case? Any suggestions to dodge this bullet (other than AVX2, which is a known option but due to updating toolchain to new version postponed for now)

General remarks or suggested improvements are also welcome.

   // r8 and r9 are #bytes to go to the next line in resp. src and dest 
   // r12=3*r8 r13=3*r9  
  // load 8x8 bytes into 4 registers, bytes interleaved.
  movq xmm1,[rcx]
  movq xmm4,[rcx+2*r8]
  PUNPCKLBW xmm1,xmm4   // 0 2 0 2 0 2
  movq xmm7,[rcx+r8]
  movq xmm6,[rcx+r12]
  PUNPCKLBW xmm7,xmm6   // 1 3 1 3 1 3

  movdqa xmm2,xmm1
  punpcklbw xmm1,xmm7   // 0 1 2 3 0 1 2 3 in xmm1:xmm2
  punpckhbw xmm2,xmm7   
  lea rcx,[rcx+4*r8]

  // same for 4..7

  movq xmm3,[rcx]
  movq xmm5,[rcx+2*r8]
  PUNPCKLBW xmm3,xmm5
  movq xmm7,[rcx+r8]
  movq xmm8,[rcx+r12]
  PUNPCKLBW xmm7,xmm8

  movdqa xmm4,xmm3
  punpcklbw xmm3,xmm7
  punpckhbw xmm4,xmm7

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm1
  pextrd  eax,xmm3,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm3
  pextrd  eax,xmm1,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,0
  pinsrd  xmm5,eax,1
  movq    [rdx],xmm5

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,1
  pinsrd  xmm5,eax,0
  movq    [rdx+r9],xmm5

  movdqa  xmm5,xmm2
  pextrd  eax,xmm4,2
  pinsrd  xmm5,eax,3
  MOVHLPS  xmm6,xmm5
  movq    [rdx+2*r9],xmm6

  movdqa  xmm5,xmm4
  pextrd  eax,xmm2,3
  pinsrd  xmm5,eax,2
  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

  lea     rdx,[rdx+4*r9]

purpose: It is really rotating images from a camera for image vision purposes . In some (heavier)apps the rotation is postponed and done display-only (opengl), in some it is easier to rotate input then to adapt algorithms.

updated code: I posted some final code here Speedup was very dependent on size of input. Large on small images, but still a factor two on larger ones compared to looptiling HLL code with a 32x32 tile. (same algo as asm code linked)

TL:DR: use punpckl/hdq to save a massive number of shuffles in the dword-rearranging step, exactly like the transpose code in A better 8x8 bytes matrix transpose with SSE?

Your memory layout requires storing the low / high 8 bytes of each vector result separately, which you can do efficiently with movq [rdx], xmm / movhps [rdx+r9], xmm.

The code is almost twice times as slow on Haswell/Skylake than on IVB

Your code bottlenecks heavily on shuffle throughput.

Haswell only has one shuffle execution unit, on port 5. SnB/IvB have 2 integer shuffle units (but still only one FP shuffle unit). See Agner Fog's instruction tables and optimization guide / microarch guide.

I see you already found David Kanter's excellent Haswell microarch write-up.

It's very easy to bottleneck on shuffle (or port5 in general) throughput for code like this, and it often gets worse with AVX / AVX2 because many shuffles are in-lane only. AVX for 128-bit ops might help, but I don't think you'll gain anything from shuffling into 256b vectors and then shuffling them apart again into 64-bit chunks. If you could load or store contiguous 256b chunks, it would be worth trying.

You have some simple missed-optimizations, even before we think about major changes:

  MOVHLPS  xmm6,xmm5
  movq    [rdx+r13],xmm6

should be movhps [rdx+r13],xmm6. On Sandybridge and Haswell, movhps is a pure store uop, with no shuffle uop required.

pextrd eax,xmm3,0 is always worse than movd eax, xmm3; never use pextrd with an immediate 0. (Also, pextrd directly to memory might be a win. You could do a 64-bit movq and then overwrite half of that with a 32-bit pextrd. You might then bottleneck on store throughput. Also, on Sandybridge, indexed addressing modes don't stay micro-fused, so more stores would hurt your total uop throughput. But Haswell doesn't have that problem for stores, only for some indexed loads depending on the instruction.) If you use more stores some places and more shuffles other places, you could use more stores for the single-register addressing modes.

Source and destination format is not a degree of freedom in image manipulation.

Depends what you're doing. x264 (the open-source h.264 video encoder) copies 8x8 blocks into contiguous buffers before working with them repeatedly, so the stride between rows is an assemble-time constant.

This saves passing a stride in a register and doing stuff like you're doing with [rcx+2*r8] / [rcx+r8]. It also lets you load two rows with one movdqa. And it gives you good memory locality for accessing 8x8 blocks.

Of course it's probably not a win to spend time copying in/out of this format if this rotation is all you're doing with an 8x8 block of pixels. FFmpeg's h.264 decoder (which uses many of of the same asm primitives as x264) doesn't use this, but IDK if that's because nobody ever bothered to port the updated x264 asm or if it's just not worth it.

  // now we join one "low" dword from XMM1:xmm2 with one "high" dword
  // from XMM3:xmm4

extract/insert to/from integer is not very efficient; pinsrd and pextrd are 2 uops each, and one of those uops is a shuffle. You might even come out ahead of your current code just using pextrd to memory in 32-bit chunks.

Also consider using SSSE3 pshufb which can put your data anywhere it needs to be, and zero other elements. This can set you up for merging with por. (You might use pshufb instead of punpcklbw).

Another option is to use shufps to combine data from two sources. You might need another shuffle after that. Or use punpckldq.

// "low" dwords from XMM1:xmm2
//  high dwords from XMM3:xmm4

;  xmm1:  [ a b c d ]   xmm2: [ e f g h ]
;  xmm3:  [ i j k l ]   xmm4: [ m n o p ]

; want: [ a i b j ] / [ c k d l ] / ... I think.

;; original: replace these with
;  movdqa  xmm5,xmm1     ; xmm5 = [ a b c d ]
;  pextrd  eax,xmm3,0    ; eax = i
;  pinsrd  xmm5,eax,1    ; xmm5 = [ a i ... ]
;  movq    [rdx],xmm5

;  movdqa  xmm5,xmm3       ; xmm5 = [ i j k l ]
;  pextrd  eax,xmm1,1      ; eax = b
;  pinsrd  xmm5,eax,0      ; xmm5 = [ b j ... ]
;  movq    [rdx+r9],xmm5

Replace with this:

   movdqa    xmm5, xmm1
   punpckldq xmm5, xmm3     ; xmm5 = [ a i b j ]
   movq     [rdx], xmm5
   movhps   [rdx+r9], xmm5  ; still a pure store, doesn't cost a shuffle

So we've replaced 4 shuffle uops with 1, and lowered the total uop count from 12 fused-domain uops (Haswell) to 4. (Or on Sandybridge, from 13 to 5 because the indexed store doesn't stay micro-fused).

Use punpckhdq for [ c k d l ], where it's even better because we're replacing movhlps as well.

 ;  movdqa  xmm5,xmm1       ; xmm5 = [ a b c d ]
 ; pextrd  eax,xmm3,2      ; eax = k
 ; pinsrd  xmm5,eax,3      ; xmm5 = [ a b c k ]
 ; MOVHLPS  xmm6,xmm5      ; xmm6 = [ c k ? ? ]  (false dependency on old xmm6)
 ; movq   [rdx+2*r9],xmm6

Then unpack lo/hi for xmm2 and xmm4.

Using AVX or AVX2 would let you skip the movdqa, because you can unpack into a new destination register instead of copy + destroy.

Aki Suihkonen

Inserting dword is more if not most efficiently done with combination of pshufd and an immediate blend.

 pshufd xmm5, xmm3, 0x55 * slot
 pblendw xmm1, xmm5, 3 << dst_slot

pblendw is SSE4.1, but is of course available on haswell. Unfortunately it only runs on port 5 on Haswell/Skylake, so it still competes with shuffles.

AVX2 vpblendd runs on any vector-ALU port (p0/p1/p5) on Haswell/Skylake, so is much more efficient than word-granularity pblendw / vpblendw.

If you need to avoid AVX2, consider using SSE4.1 blendps to blend 32-bit elements with an immediate control. It runs on any port on Haswell (or p0/p5 on Sandybridge vs. p1/p5 for shuffles), and the latency penalty for using it on integer data should be irrelevant for your case.

来源：https://stackoverflow.com/questions/47478010/sse2-8x8-byte-matrix-transpose-code-twice-as-slow-on-haswell-then-on-ivy-bridge

标签

performance

assembly

x86

sse

simd