Looping over arrays with inline assembly

自闭症网瘾萝莉.ら 提交于 2019-11-26 16:45:51

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.


You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.

GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.

Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length. i.e. the asm can't reorder with any stores that use fptr as part of the address (or that use the array it's known to point into). Also works with an "=m" or "+m" constraint (without the const, obviously).

Using a specific size like "m" (*(const float (*)[4]) fptr) lets you tell the compiler what you do/don't read. (Or write). Then it can (if otherwise permitted) sink a store to a later element past the asm statement, and combine it with another store (or do dead-store elimination) of any stores that your inline asm doesn't read.


Another huge benefit to a m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.


Here's my version, with some tweaks as noted in comments.

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // you can avoid a "memory" clobber with dummy input/output operands
        );
    }
}

Godbolt compiler explorer asm output for this and a couple versions below.

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.

If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.

In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    addq    $16, %r10       #, ivtmp.19
    addq    $16, %r9        #, ivtmp.21
    addq    $16, %r8        #, ivtmp.22
    cmpl    %eax, %ecx      # i, n
    ja      .L11        #,

r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.

You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const struct {char a; char x[];} *) pStr) from @David Wohlferd's answer on an asm strlen. Since we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address as an operand, rather than a pointer to the current memory being operated on.

This actually works without any extra counter increments inside the loop:

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
                             float *restrict z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
             , "=m" (*(struct {float a; float x[];} *) z)
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
              , "m" (*(const struct {float a; float x[];} *) x),
                "m" (*(const struct {float a; float x[];} *) y)
        );
    }
}

This gives us the same inner loop we got with a "memory" clobber:

.L19:   # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    cmpl    %eax, %ecx      # i, n
    ja      .L19        #,

It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective.


A version with m constraints, that gcc can unroll:

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"
            "addps    %[xi], %[vectmp]\n\t"  // We requested that the %[yi] input be in the same register as the [vectmp] dummy output
            "movaps   %[vectmp], %[zi]\n\t"
          // ugly ugly type-punning casts; __m128 is a may_alias type so it's safe.
            : [vectmp] "=x" (vectmp), [zi] "=m" (*(__m128*)&z[i])
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
            , [xi] "xm" (*(__m128*)&x[i])
            :  // memory clobber not needed
        );
    }
}

Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

When I compile your add_asm2 code with gcc (4.9.2) I get:

add_asm2:
.LFB0:
        .cfi_startproc
        xorl        %eax, %eax
        xorl        %r8d, %r8d
        testl       %ecx, %ecx
        je  .L1
        .p2align 4,,10
        .p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
        movaps   (%rsi,%rax), %xmm0
addps    (%rdi,%rax), %xmm0
movaps   %xmm0, (%rdx,%rax)

# 0 "" 2
#NO_APP
        addl        $4, %r8d
        addq        $16, %rax
        cmpl        %r8d, %ecx
        ja  .L5
.L1:
        rep; ret
        .cfi_endproc

so it is not perfect (it uses a redundant register), but does use indexed loads...

gcc also has builtin vector extensions which are even cross platform:

typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i+=1) {
        *(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
    }
}

On my gcc version 4.7.2 the generated assembly is:

.L28:
        movaps  (%rdi,%rax), %xmm0
        addps   (%rsi,%rax), %xmm0
        movaps  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rcx, %rax
        jne     .L28
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!