avx512 | 易学教程

In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?

阅读更多关于 In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?

问题 While trying to answer Embedded broadcasts with intrinsics and assembly, I was trying to do something like this: __m512 mul_broad(__m512 a, float b) { int scratch = 0; asm( "vbroadcastss %k[scalar], %q[scalar]\n\t" // want vbr.. %xmm0, %zmm0 "vmulps %q[scalar], %[vec], %[vec]\n\t" // how it's done for integer registers "movw symbol(%q[inttmp]), %w[inttmp]\n\t" // movw symbol(%rax), %ax "movsbl %h[inttmp], %k[inttmp]\n\t" // movsx %ah, %eax : [vec] "+x" (a), [scalar] "+x" (b), [inttmp] "=r"

Which versions of Windows support/require which CPU multimedia extensions? [closed]

阅读更多关于 Which versions of Windows support/require which CPU multimedia extensions? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . So far I have managed to find out that: SSE and SSE2 are mandatory for Windows 8 and later (and of course for any 64-bit OS) AVX is only supported by Windows 7 SP1 or later Are there any caveats regarding using SSE3, SSSE3, SSE4.1, SSE 4.2, AVX2 and AVX-512 on Windows? Some clarification: I need this to

How to convert a number to hex?

阅读更多关于 How to convert a number to hex?

问题 Given a number in a register (a binary integer), how to convert it to a string of hexadecimal ASCII digits? Digits can be stored in memory or printed on the fly, but storing in memory and printing all at once is usually more efficient. (You can modify a loop that stores to instead print one at a time.) Can we efficiently handle all the nibbles in parallel with SIMD? (SSE2 or later?) 回答1: 16 is a power of 2. Unlike decimal (How do I print an integer in Assembly Level Programming without printf

Per-element atomicity of vector load/store and gather/scatter?

阅读更多关于 Per-element atomicity of vector load/store and gather/scatter?

问题 Consider an array like atomic<int32_t> shared_array[] . What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed) ?. Or to search an array for the first non-zero element, or zero a range of it? It\'s probably rare, but consider any use-case where tearing within an element is not allowed, but reordering between elements is fine. (Perhaps a search to find a candidate for a CAS). I think x86 aligned vector loads/stores would be safe in practice to use on for