Faster way to test if xmm/ymm register is zero?

Deadly 提交于 2019-12-10 16:48:16

问题


It's fortunate that PTEST does not affect the carry flag, but only sets the (rather awkward) ZF. also affects both CF and ZF.

I've come up with the following sequence to test a large number of values, but I'm unhappy with the poor running time.

              Latency / rThoughput
setup:
  xor eax,eax       ; na
  vpxor xmm0,xmm0   ; na       ;mask to use for the nand operation of ptest
work:
  vptest xmm4,xmm0  ; 3   1    ;is xmm4 alive?
  adc eax,eax       ; 1   1    ;move first bit into eax
  vptest xmm5,xmm0  ; 3   1    ;is N alive?
  adc eax,eax       ; 1   1    ;move consecutive bits into eax 

I want to have a bitmap of all the non-zero registers in eax (obviously I can combine multiple bitmaps in multiple registers).

So every test has a latency of 3+1 = 4 cycles.
Some of this can run in parallel by alternating between eax,ecx etc.
But it's still quite slow.
Is there a faster way of doing this?

I need to test 8 xmm/ymm registers in a row. 1 bit per register in a one byte bitmap.


回答1:


Rather than being "quite slow" your existing approach is reasonable, actually.

Sure each individual test has a latency of 4 cycles1, but it if you want the result in a general purpose register you are usually going to pay a 3 cycle latency for that move anyway (e.g., movmskb also has a latency of 3). In any case, you want to test 8 registers, and you don't simply add the latencies because each one is mostly independent, so uop count and port use will likely end up being more important that the latency to test a single register as most of the latencies will overlap with other work.

An approach that is likely to be a bit faster on Intel hardware is using successive PCMPEQ instructions, to test several vectors, and then folding the results together (e.g., if you use PCMPEQQ you effectively have 4 quadword results and need to and-fold them into 1). You can either fold before or after the PCMPEQ, but it would help to know more about how/where you want to results to come up with something better. Here's an untested sketch for 8 registers, xmm1-8 with xmm0 assumed zero, and xmm14 being the pblendvb mask to select alternate bytes used in the last instruction.

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm1, xmm0
vpcmpeqq xmm12, xmm3, xmm0
vpcmpeqq xmm13, xmm5, xmm0
vpcmpeqq xmm14, xmm7, xmm0

# blend the results down into xmm10   word origin
vpblendw xmm10, xmm11, xmm12, 0xAA   # 3131 3131
vpblendw xmm13, xmm13, xmm14, 0xAA   # 7575 7575
vpblendw xmm10, xmm10, xmm13, 0xCC   # 7531 7531

# test the 2 qwords in each vector against zero
vpcmpeqq xmm11, xmm2, xmm0
vpcmpeqq xmm12, xmm4, xmm0
vpcmpeqq xmm13, xmm6, xmm0
vpcmpeqq xmm14, xmm8, xmm0

# blend the results down into xmm11   word origin
vpblendw xmm11, xmm11, xmm12, 0xAA   # 4242 4242
vpblendw xmm13, xmm13, xmm14, 0xAA   # 8686 8686
vpblendw xmm11, xmm11, xmm13, 0xCC   # 8642 8642

# blend xmm10 and xmm11 together int xmm100, byte-wise
#         origin bytes
# xmm10 77553311 77553311
# xmm11 88664422 88664422
# res   87654321 87654321 
vpblendvb xmm10, xmm10, xmm11, xmm15

# move the mask bits into eax
vpmovmskb eax, xmm10
and al, ah

The intuition is that you test each QWORD in each xmm against zero, giving 16 results for the 8 registers, and then you blend the results together into xmm10 ending up with one result per byte, in order (with all high-QWORD results before all the low-QWORD results). Then you move those 16 byte masks as 16-bits into eax with movmskb and finally combine the high and low QWORD bits for each register inside eax.

That looks to me like 16 uops total, for 8 registers, so about 2 uops per register. The total latency is reasonable since it largely a "reduce" type parallel tree. A limiting factor would be the 6 vpblendw operations which all go only to port 5 on modern Intel. It would be better to replace 4 of those with VPBLENDD which is the one "blessed" blend that goes to any of p015. That should be straightforward.

All the ops are simple and fast. The final and al, ah is a partial register write, but if you mov it after into eax perhaps there is no penalty. You could also do that last line a couple of different ways if that's an issue...

This approach also scales naturally to ymm registers, with slightly different folding in eax at the end.

EDIT

A slightly faster ending uses packed shifts to avoid two expensive instructions:

;combine bytes of xmm10 and xmm11 together into xmm10, byte wise
; xmm10 77553311 77553311
; xmm11 88664422 88664422   before shift
; xmm10 07050301 07050301
; xmm11 80604020 80604020   after shift
;result 87654321 87654321   combined
vpsrlw xmm10,xmm10,8
vpsllw xmm11,xmm11,8
vpor xmm10,xmm10,xmm11

;combine the low and high dqword to make sure both are zero. 
vpsrldq xmm12,xmm10,64
vpand xmm10,xmm12
vpmovmskb eax,xmm10

This saves 2 cycles by avoiding the 2 cycle vpblendvb and the partial write penalty of or al,ah, it also fixes the dependency on the slow vpmovmskb if don't need to use the result of that instruction right away.


1Actually it seems to be only on Skylake that PTEST has a latency of three cycles, before that it seems to be 2. I'm also not sure about the 1 cycle latency you listed for rcl eax, 1: according to Agner, it seems to be 3 uops and 2 cycles latency/recip throughput on modern Intel.



来源:https://stackoverflow.com/questions/42317528/faster-way-to-test-if-xmm-ymm-register-is-zero

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!