问题

My goal is to create a PCIe transaction with more than 64b payload. For that I need to read an ioremap() address.

For 128b and 256b I can use xmm and ymm registers respectively and that works as expected.

Now, I'd like to do the same for 512b zmm registers (memory-like storage?!)

A code under license I'm not allowed to show here, uses assembly code for 256b:

void __iomem *addr;
uint8_t datareg[32];
[...]
// Read memory address to ymm (to have 256b at once):
asm volatile("vmovdqa %0,%%ymm1" : : "m"(*(volatile uint8_t * __force) addr));
// Copy ymm data to stack data: (to be able to use that in a gcc handled code)
asm volatile("vmovdqa %%ymm1,%0" :"=m"(datareg): :"memory");

This is to be used in a kernel module compiled with EXTRA_CFLAGS += -mavx2 -mavx512f to support AVX-512. edit:To check at compile time if __AVX512F__ and __AVX2__ are supported.

Why does this example use ymm1 and not a different register ymm0-2-3-4..15?
How can I read an address to a 512b zmm register?
How can I be sure the register won't be overwritten between the two asm lines?

Simply replacing ymm by zmm, gcc shows Error: operand size mismatch forvmovdqa'`.

If that code isn't correct or the best practice, let solve that first since I just started to dig into that.

回答1:

You need vmovdqa32 because AVX512 has per-element masking; all instructions need a SIMD element size. See below for a version that should be safe. You would have seen this if you read the manual for vmovdqa; vmovdqa32 for ZMM is documented in the same entry.

(3): Kernel code is compiled with SSE/AVX disabled so the compiler won't ever generate instructions that touch xmm/ymm/zmm registers. (For most kernels, e.g. Linux). That's what makes this code "safe" from having the register modified between asm statements. It's still a bad idea to make them separate statements for this use-case though, despite the fact that Linux md-raid code does that. OTOH letting the compiler schedule some other instructions between the store and load is not a bad thing.

Ordering between asm statements is provided by them both being volatile - compilers can't reorder volatile operations with other volatile operations, only with plain operations.

In Linux for example, it's only safe to use FP / SIMD instructions between calls to kernel_fpu_begin() and kernel_fpu_end() (which are slow: begin saves the whole SIMD state on the spot, and end restores it or at least marks it as needing to happen before return to user-space). If you get this wrong, your code will silently corrupt user-space vector registers!!

This is to be used in a kernel module compiled with EXTRA_CFLAGS += -mavx2 -mavx512f to support AVX-512.

You must not do that. Letting the compiler emit its own AVX / AVX512 instructions in kernel code could be disastrous because you can't stop it from trashing a vector reg before kernel_fpu_begin(). Only use vector regs via inline asm.

Also note that using ZMM registers at all temporarily reduces max turbo clock speed for that core (or on a "client" chip, for all cores because their clock speeds are locked together). See SIMD instructions lowering CPU frequency

I'd like to use 512b zmm* registers as memory-like storage.

With fast L1d cache and store-forwarding, are you sure you'd even gain anything from using ZMM registers as fast "memory like" (thread-local) storage? Especially when you can only get data out of SIMD registers and back into integer regs via store/reload from an array (or more inline asm to shuffle...). A few places in Linux (like md RAID5/RAID6) use SIMD ALU instructions for block XOR or raid6 parity, and there it is worth the overhead of kernel_fpu_begin(). But if you're just loading / storing to use ZMM / YMM state as storage that can't cache-miss, not looping over big buffers, it's probably not worth it.

(Edit: turns out you actually want to use 64-byte copies to generate PCIe transactions, which is a totally separate use-case than keeping data around in registers long-term.)

If you just wanted to copy 64 bytes with a one-instruction load

Like you apparently actually do, to get a 64-byte PCIe transaction.

It would be better to make this a single asm statement, because otherwise there's no connection between the two asm statements other than both being asm volatile forces that ordering. (If you were doing this with AVX instructions enabled for the compiler's use, you'd simply use intrinsics though, not "=x" / "x" outputs / inputs to connect separate asm statements.)

Why the example chose ymm1? As good as any other random choice of ymm0..7 to allow a 2-byte VEX prefix (ymm8..15 might need more code size on those instructions.) With AVX code-gen disabled there's no way to ask the compiler to pick a convenient register for you with a dummy output operand.

uint8_t datareg[32]; is broken; it needs to be alignas(32) uint8_t datareg[32]; to ensure that a vmovdqa store won't fault.

The "memory" clobber on the output is useless; the whole array is already an output operand because you named an array variable as the output, not just a pointer. (In fact, casting to pointer-to-array is how you tell the compiler that a plain dereferenced-pointer input or output is actually wider, e.g. for asm that contains loops or in this case for asm that uses SIMD when we can't tell the compiler about the vectors. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

The asm statement is volatile so it won't be optimized away to reuse the same output. The only C object touched by the asm statement is the array object which is an output operand so the compilers knows about that effect already.

AVX512 version:

AVX512 has per-element masking as part of any instruction, including loads/stores. That means there's vmovdqa32 and vmovdqa64 for different masking granularity. (And vmovdqu8/16/32/64 if you include AVX512BW). FP versions of instructions already have ps or pd baked in to the mnemonic so the mnemonic stays the same for ZMM vectors there. You'd see this right away if you looked at compiler-generated asm for an auto-vectorized loop with 512-bit vectors, or intrinsics.

This should be safe:

#include <stdalign.h>
#include <stdint.h>
#include <string.h>

#define __force 
int foo (void *addr) {
    alignas(16) uint8_t datareg[64];   // 16-byte alignment doesn't cost any extra code.
      // if you're only doing one load per function call
      // maybe not worth the couple extra instructions to align by 64

    asm volatile (
      "vmovdqa32  %1, %%zmm16\n\t"   // aligned
      "vmovdqu32  %%zmm16, %0"       // maybe unaligned; could increase latency but prob. doesn't hurt throughput much compared to an IO read.
        : "=m"(datareg)
        : "m" (*(volatile const char (* __force)[64]) addr)  // the whole 64 bytes are an input
     : // "memory"  not needed, except for ordering wrt. non-volatile accesses to other memory
    );

    int retval;
    memcpy(&retval, datareg+8, 4);  // memcpy can inline as long as the kernel doesn't use -fno-builtin
                    // but IIRC Linux uses -fno-strict-aliasing so you could use cast to (int*)
    return retval;
}

Compiles on the Godbolt compiler explorer with gcc -O3 -mno-sse to

foo:
        vmovdqa32  (%rdi), %zmm16
        vmovdqu32  %zmm16, -72(%rsp)
        movl    -64(%rsp), %eax
        ret

I don't know how your __force is defined; it might go in front of addr instead of as the array-pointer type. Or maybe it goes as part of the volatile const char array element type. Again, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more about that input cast.

Since you're reading IO memory, asm volatile is necessary; another read of the same address could read a different value. Same if you were reading memory that another CPU core could have modified asynchronously.

Otherwise I think asm volatile is not necessary if you want to let the compiler optimize away doing the same copy.

A "memory" clobber also isn't necessary: we tell the compiler about the full width of both the input and the output, so it has a full picture of what's going on.

If you need ordering wrt. other non-volatile memory accesses, you could use a "memory" clobber for that. But asm volatile is ordered wrt. dereferences of volatile pointers, including READ_ONCE and WRITE_ONCE which you should be using for any lock-free inter-thread communication (assuming this is the Linux kernel).

ZMM16..31 doesn't need a vzeroupper to avoid performance problems, and EVEX is always fixed length.

I only aligned the output buffer by 16 bytes. If there's an actual function call that doesn't get inlined for each 64-byte load, overhead of aligning RSP by 64 might be more than the cost of a cache-line-split store 3/4 of the time. Store-forwarding I think still works efficiently from that wide store to narrow reloads of chunks of that buffer on Skylake-X family CPUs.

If you're reading into a larger buffer, use that for output instead of bouncing through a 64-byte tmp array.

There are probably other ways to generate wider PCIe read transactions; if the memory is in a WC region then 4x movntdqa loads from the same aligned 64-byte block should work, too. Or 2x vmovntdqa ymm loads; I'd recommend that to avoid turbo penalties.

来源：https://stackoverflow.com/questions/60699914/how-to-load-a-avx-512-zmm-register-from-a-ioremap-address

标签

gcc