avx512

In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?

守給你的承諾、 提交于 2019-11-26 17:33:39
问题 While trying to answer Embedded broadcasts with intrinsics and assembly, I was trying to do something like this: __m512 mul_broad(__m512 a, float b) { int scratch = 0; asm( "vbroadcastss %k[scalar], %q[scalar]\n\t" // want vbr.. %xmm0, %zmm0 "vmulps %q[scalar], %[vec], %[vec]\n\t" // how it's done for integer registers "movw symbol(%q[inttmp]), %w[inttmp]\n\t" // movw symbol(%rax), %ax "movsbl %h[inttmp], %k[inttmp]\n\t" // movsx %ah, %eax : [vec] "+x" (a), [scalar] "+x" (b), [inttmp] "=r"

Which versions of Windows support/require which CPU multimedia extensions? [closed]

試著忘記壹切 提交于 2019-11-26 17:15:45
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows? Some clarification: I need this to

How to convert a number to hex?

被刻印的时光 ゝ 提交于 2019-11-26 12:31:21
问题 Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once is usually more efficient. (You can modify a loop that stores to instead print one at a time.) Can we efficiently handle all the nibbles in parallel with SIMD? (SSE2 or later?) 回答1: 16 is a power of 2. Unlike decimal (How do I print an integer in Assembly Level Programming without printf

Per-element atomicity of vector load/store and gather/scatter?

放肆的年华 提交于 2019-11-26 11:27:25
问题 Consider an array like atomic<int32_t> shared_array[] . What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed) ?. Or to search an array for the first non-zero element, or zero a range of it? It\'s probably rare, but consider any use-case where tearing within an element is not allowed, but reordering between elements is fine. (Perhaps a search to find a candidate for a CAS). I think x86 aligned vector loads/stores would be safe in practice to use on for