Getting GCC to generate a PTEST instruction when using vector extensions

拟墨画扇 提交于 2019-12-05 08:07:43

gcc 4.9.2 -O3 -mavx2 (in 64bit mode) didn't realize it could use ptest for this, with either || or |.

The | version extracts the vector elements with vmovd and vpextrd, and combines things with 7 or insns between 32bit registers. So it's pretty bad, and doesn't take advantage of any simplifications that will still produce the same logical truth value.

The || version is just as bad, and does the same extract-an-element-at-a-time, but does a test / jne for every one.

So at this point, you can't count on GCC recognizing tests like this and doing anything remotely efficient. (pcmpeq / movmsk / test is another sequence that wouldn't be bad, but gcc doesn't generate that either.)

Wouldn't vptest help? If you are looking at performance, sometimes you'll be surprised by what the native type can provide. Here is some code that uses vanilla memcmp() and also the vptest instruction (employed via the corresponding intrinsic). I did not time the functions.

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <immintrin.h>

typedef uint32_t v8ui __attribute__ ((vector_size (32)));

v8ui*
foo1(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };

    if (memcmp(mem, &v, sizeof (v8ui)) == 0) {
            printf("Ones\n");
    } else {
            printf("NOT Ones\n");
    }

    return mem;
}

v8ui*
foo2(v8ui *mem)
{   
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    __m256i a, b;

    a = _mm256_loadu_si256((__m256i *)(&v));
    b = _mm256_loadu_si256((__m256i *)(&mem));

    if (!_mm256_testz_si256(a, b)) {
            printf("NOT Ones\n");
    } else {
            printf("Ones\n");
    }

    return mem;
}

int
main()
{
    v8ui v = (v8ui){ 1, 1, 1, 1, 1, 1, 1, 1 };
    foo1(&v);
    foo2(&v);
}

Compile flags:

gcc -mavx2 foo.c

Doh! Only now did I see that you wanted to get GCC to generate the vptest instruction without using the intrinsics. I'll leave the code around anyway.

If the compiler isn't optimal enough to produce an optimisation automatically, you have three options:

  • Get a new compiler.
  • Produce the optimisation manually (eg. using intrinsics such as in your test and the other answer).
  • Modify the compiler to produce the optimisation automatically.

You've pretty much excluded the first option automatically by using gcc extensions, though llvm/clang might extend these extensions for you.

You've excluded the second option quite blatantly.

The third option seems like your best option to me. gcc is open source, so you can make (and commit) your own changes to it. If you can modify gcc to produce this optimisation automatically (ideally from 100% standard C), then you'll not only achieve your goal of producing this optimisation without introducing crud into your program, but you'll also save countless manual optimisations (especially the non-standard ones that lock you into using a particular compiler) in the future.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!