instrinsic _mm512_round_ps is missing for AVX512

问题

I'm missing the intrinsic _mm512_round_ps for AVX512 (it is only available for KNC). Any idea why this is not available?

What would be a good workaround?

apply _mm256_round_ps to upper and lower half and fuse the results?
use _mm512_add_round_ps with one argument being zero?

Thanks!

回答1:

TL:DR: AVX512F

__m512 nearest_integer = _mm512_roundscale_ps(input_vec, _MM_FROUND_TO_NEAREST_INT|_MM_FROUND_NO_EXC);

related: AVX512DQ _mm512_reduce_pd or _ps will subtract the integer part (and a specified number of leading fraction bits), range-reducing your input to only the fractional part. asm docs for vreducepd have the most detail.

The EVEX prefix allows overriding the default rounding direction {er} and setting suppress-all-exceptions {sae}, for FP instructions. (This is what the ..._round_ps() versions of intrinsics are for.) But it doesn't have a "round to integer" option; you still need a separate asm instruction for that.

vroundps xy, xy/mem, imm8 didn't get upgraded to AVX512. Actually it did: the same opcode has a new mnemonic for the EVEX version, using the high 4 bits of the immediate that are reserved in the SSE and VEX encodings.

vrndscaleps xyz, xyz/mem/m32broadcast, imm8 is available in ss/sd/ps/pd flavours. The high 4 bits of the imm8 specify the number of fraction bits to round to. In these terms, rounding to the nearest integer is rounding to 0 fraction bits. Rounding to nearest 0.5 would be rounding to 1 fraction bit. It's the same as scaling by 2^M, rounding to nearest integer, then scaling back down (done without overflow).

I think the field is unsigned, so you can't use M=-1 to round to an even number. The ISA ref manual doesn't mention signedness, so I'm leaning towards unsigned being the most likely.

The low 4 bits of the field specify the rounding mode like with roundps. As usual, the PD version of the instruction has the diagram (because it's alphabetically first).

With the upper 4 bits = 0, it behaves the same as roundps: they use the same encoding for the low 4 bits. It's not a coincidence that the instructions have the same opcode, just different prefixes.

(I'm curious if SSE or VEX roundpd on an AVX512 CPU would actually scale based on the upper 4 bits; it says they're "reserved" not "ignored". But probably not.)

__m512 _mm512_roundscale_ps( __m512 a, int imm); is the no-frills intrinsic. See Intel's intrinsic finder

The merge-masking + SAE-override version is __m512 _mm512_mask_roundscale_round_ps(__m512 s, __mmask16 k, __m512 a, int imm, int sae);. There's nothing you can do with the sae operand that roundscale can't already do with its imm8, though, so it's a bit pointless.

You can use the _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC and so on constants documented for _mm_round_pd / _mm256_round_pd, to round up, down, or truncate towards zero, or the usual nearest with even-as-tiebreak that's the IEEE default rounding mode. Or _MM_FROUND_CUR_DIRECTION to use whatever the current mode is. _MM_FROUND_NO_EXC suppresses setting the inexact exception bit in the MXCSR.

You might be wondering why vrndscaleps needs any immediate bits to specify rounding direction when you could just use the EVEX prefix to override the rounding direction with vrndscaleps zmm0 {k1}, zmm1, {rz-sae} (Or whatever the right syntax is; NASM doesn't seem to be accepting any of the examples I found.)

The answer is that explicit rounding is only available with 512-bit vectors or with scalars, and only for register operands. (It repurposes 3 EVEX bits used to set vector length (if AVX512VL is supported), and to distinguish between broadcast memory operands vs. vector. EVEX bits are overloaded based on context to pack more functionality into limited space.)

So having the rounding-control in the imm8 makes it possible to do vrndscaleps zmm0{k1}, [rdi]{m32bcst}, imm8 to broadcast a float from memory, round it, and merge that into an existing register according to mask register k1. All in a single instruction which decodes to probably 3 uops on SKX, assuming it's the same as vroundps. (http://agner.org/optimize/).

来源：https://stackoverflow.com/questions/50854991/instrinsic-mm512-round-ps-is-missing-for-avx512

标签

avx512