SSE vectorization of math 'pow' function gcc

戏子无情 提交于 2019-12-01 16:43:47

Using __restrict or consuming inputs (assigning to local vars) before writing outputs should help.

As it is now, the compiler cannot vectorize because a might alias b, so doing 4 multiplies in parallel and writing back 4 values might not be correct.

(Note that __restrict won't guarantee that the compiler vectorizes, but so much can be said that right now, it sure cannot).

This is not really an answer to your question; but rather a suggestion for how might be able to avoid this issue entirely.

You mention that you're on OS X; there are already APIs on that platform that provide the operations you're looking at, without any need for auto-vectorization. Is there some reason that you aren't using them instead? Auto-vectorization is really cool, but it requires some work, and in general it doesn't produce results that are as good as using APIs that are already vectorized for you.

#include <string.h>
#include <Accelerate/Accelerate.h>

int main() {

    int n = 256;
    float a[256],
    b[256];

    // You can initialize the elements of a vector to a set value using memset_pattern:
    float threehalves = 1.5f;
    memset_pattern4(a, &threehalves, 4*n);

    // Since you have a fixed exponent for all of the base values, we will use
    // the vImage gamma functions.  If you wanted to have different exponents
    // for each input (i.e. from an array of exponents), you would use the vForce
    // vvpowf( ) function instead (also part of Accelerate).
    //
    // If you don't need full accuracy, replace kvImageGamma_UseGammaValue with
    // kvImageGamma_UseGammaValue_half_precision to get better performance.
    GammaFunction func = vImageCreateGammaFunction(2.3f, kvImageGamma_UseGammaValue, 0);
    vImage_Buffer src = { .data = a, .height = 1, .width = n, .rowBytes = 4*n };
    vImage_Buffer dst = { .data = b, .height = 1, .width = n, .rowBytes = 4*n };
    vImageGamma_PlanarF(&src, &dst, func, 0);
    vImageDestroyGammaFunction(func);

    // To simply square a instead, use the vDSP_vsq function.
    vDSP_vsq(a, 1, b, 1, n);

    return 0;
}

More generally, unless your algorithm is quite simple, auto-vectorization is unlikely to deliver great results. In my experience, the spectrum of vectorization techniques usually looks about like this:

better performance                                            worse performance
more effort                                                         less effort
+------+------+----------------------+----------------------------+-----------+
|      |      |                      |                            |           |
|      |  use vectorized APIs        |                   auto vectorization   |
|  skilled vector C                  |                              scalar code
hand written assembly       unskilled vector C
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!