Why am I getting these assembler errors?

问题

I have a big function that needs to convert from floats to integers at a point. Without this conversion the function takes 11-12 ns/loop on my machine. With the conversion it takes ~ 400 ns/loop.

After some reading I found a way to speed the conversion up using a bit of inline assembly. The first iteration of my function was as follows:

inline int FISTToInt (float f)
{
    int i;
    asm("fld %1;"
        "fistp %0;"
        :"=r" ( i )
        :"r" ( f )
        :
    );
    return i;
}

when I compiled that I got the following errors:

src/calcRunner.cpp: Assembler messages:
src/calcRunner.cpp:43: Error: operand type mismatch for `fld'
src/calcRunner.cpp:43: Error: operand type mismatch for `fistp'

A bit of thought supplied the answer, I forgot the instruction suffixes, so I changed the function to be as follows:

inline int FISTToInt (float f)
{
    int i;
    asm("flds %1;"
        "fistps %0;"
        :"=r" ( i )
        :"r" ( f )
        :
    );
    return i;
}

However this did not fix the problem, instead I get this:

src/calcRunner.cpp: Assembler messages:
src/calcRunner.cpp:43: Error: invalid instruction suffix for `fld'
src/calcRunner.cpp:43: Error: invalid instruction suffix for `fistp'

What is going on?

回答1:

This works:

int trunk(float x)
{
    int i;
    __asm__ __volatile__(
    "    flds   %1\n"
    "    fistpl %0\n"
    : "=m"(i) : "m"(x));
    return i; 
}

However, it's only (possibly) faster than the compiler generated code if you are actually using x87 mode, and it's faster because it's not loading and storing the FP control word that determines the rounding. I will get back with a couple of benchmarks...

Simple benchmark:

#include <stdio.h>
#include <stdlib.h>

int trunk(float x)
{
    int i;
    __asm__ __volatile__(
    "    flds   %1\n"
    "    fistpl %0\n"
    : "=m"(i) : "m"(x));
    return i; 
}


int trunk2(float x)
{
    return (int)x;
}

inline long long rdtsc()
{
    unsigned long a, d;
    __asm volatile ("rdtsc" : "=a" (a), "=d" (d) : : "ebx", "ecx"); 
    return a | ((long long)d << 32);
}


int main()
{
    float f[1000];
    for(int i = 0; i < 1000; i++)
    {
    f[i] = rand() / (i+1); 
    }
    long long t = rdtsc();
    int sum = 0;
    for(int i = 0; i < 1000; i++)
    {
    sum = trunk(f[i]);
    }
    t = rdtsc() - t;
    printf("Sum=%d time=%ld\n", sum, t);

    t = rdtsc();
    sum = 0;
    for(int i = 0; i < 1000; i++)
    {
    sum = trunk2(f[i]);
    }
    t = rdtsc() - t;
    printf("Sum=%d time=%ld\n", sum, t);

    return 0;
}

Compiled with gcc -O2 -m64 -std=c99, it produces the following result:

Sum=1143565 time=30196
Sum=1143565 time=15946

In a 32-bit compile (gcc -O2 -m32 -std=c99):

Sum=1143565 time=29847
Sum=1143565 time=107618

In other words, it's a lot slower. However, if we enable sse2 (and remove: gcc -m32 -msse2 -mfpmath=sse -O2, it gets much better:

Sum=1143565 time=30277
Sum=1143565 time=11789

Note that the first number is "your solution", where the second result is the compiler's solution.

Obviously, please do measure on your system, to ensure the results do indeed match up.

Edit: After finding that I should actually add the numbers in the loop, rather than just walk through them putting them in sum, I get the following results for clang:

clang -m32 -msse2 -mfpmath=sse -O2 floatbm.c -std=c99

Sum=625049287 time=30290
Sum=625049287 time=3663

The explanation to why it is so much better in "let the compiler do the job" is that Clang 3.5 is producing an unrolled loop with proper SSE simd for the second loop - it can't do that for the first loop, so each iteration is 1 float value.

Just to show that gcc still gives the same result, I rerun with gcc:

Sum=625049287 time=31612
Sum=625049287 time=15007

Only difference from before is that I use sum += trunk(f[i]); instead of sum = ....

回答2:

Floats are memory operands, not register. So you need this:

inline int FISTToInt (float f) {
    int i;
    asm("flds %1;"
        "fistl %0;"
        :"=m" ( i )
        :"m" ( f )
        :
    );
    return i;   
}

Note that s is for 16-bit integers, but 32-bit single (float) for floating point, and l is a 32-bit int for integers, but 64-bit double for floating point.

Live demo

This seems like a decent resource

回答3:

If you can do it faster than your compiler, throw that one as far as you possibly can, and get a decent one.

And please tell us here, so nobody else will even think of using it in earnest.

来源：https://stackoverflow.com/questions/22392413/why-am-i-getting-these-assembler-errors

标签

c++

gcc

inline-assembly