Is _mm256_store_ps() function is atomic ? while using alongside openmp

最后都变了- 提交于 2019-12-06 12:26:42

The documentation for _mm256_store_ps says:

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

You can use _mm256_storeu_si256 instead for unaligned stores.


A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.

Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:

float* allocate_aligned(size_t n) {
    constexpr size_t alignment = alignof(__m256);
    return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);

In C++-17 new can allocate with alignment:

float* allocate_aligned(size_t n) {
    constexpr auto alignment = std::align_val_t{alignof(__m256)};
    return new(alignment) float[n];
}

Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:

#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>

Vc::float_v random_float_v() {
    alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
    for(unsigned i = 0; i < Vc::float_v::Size; ++i)
        t[i] = std::rand() % 10;
    return Vc::float_v(t, Vc::Aligned);
}

unsigned reverse_crc32(void const* vbegin, void const* vend) {
    unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
    unsigned const* end = reinterpret_cast<unsigned const*>(vend);
    unsigned r = 0;
    while(begin != end)
        r = __builtin_ia32_crc32si(r, *--end);
    return r;
}

int main() {
    constexpr size_t N = 65536;
    constexpr size_t M = N / Vc::float_v::Size;

    std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);

    for(unsigned i = 0; i < M; ++i) {
        a[i] = random_float_v();
        b[i] = random_float_v();
        c[i] = random_float_v();
    }

    auto t0 = std::chrono::high_resolution_clock::now();
    for(unsigned i = 0; i < M; ++i)
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    auto t1 = std::chrono::high_resolution_clock::now();

    double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
    unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
    std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}

Parallel version:

#include <tbb/parallel_for.h>
// ...
    auto t0 = std::chrono::high_resolution_clock::now();
    tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    });
    auto t1 = std::chrono::high_resolution_clock::now();

You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).

This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!