Is _mm256_store_ps() function is atomic ? while using alongside openmp

问题

I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().

I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.

#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64

__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
  return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}

void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
  __m256 a_intel, b_intel, c_intel, d_intel;
  #pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
  for(long i=0; i<N; i=i+8) {
    a_intel = _mm256_loadu_ps(&a[i]);
    b_intel = _mm256_loadu_ps(&b[i]);
    c_intel = _mm256_loadu_ps(&c[i]);
    d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
    _mm256_store_ps(&d[i],d_intel);
  }
}
int main()
{
    srand(time(NULL));
    float * a = (float *) malloc(sizeof(float) * N);
    float * b = (float *) malloc(sizeof(float) * N);
    float * c = (float *) malloc(sizeof(float) * N);
    float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
    int i;
    for(i=0;i<N;i++)
    {
        a[i] = (float)(rand()%10);
        b[i] = (float)(rand()%10);
        c[i] = (float)(rand()%10);
    }
    double time_t = omp_get_wtime();
    multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
    time_t = omp_get_wtime() - time_t;
    printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
  }

  free(a);
  free(b);
  free(c);
  free(d_intel_avx_omp);
    return 0;
}

I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.

gcc version 7.3.0
Intel® Core™ i3-3110M Processor
OS Ubuntu 18.04
Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
Command to compile, gcc first_int.c -mavx -fopenmp

** UPDATE **

As per the discussions and suggestions, the new code is,

 float * a = (float *) aligned_alloc(N, sizeof(float) * N);
 float * b = (float *) aligned_alloc(N, sizeof(float) * N);
 float * c = (float *) aligned_alloc(N, sizeof(float) * N);
 float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);

This working without perfectly.

Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,

Time taken to calculate without AVX : 0.00037

Time taken to calculate with AVX : 0.00024

Time taken to calculate with AVX and OMP : 0.00019

N = 50000

回答1:

The documentation for _mm256_store_ps says:

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

You can use _mm256_storeu_si256 instead for unaligned stores.

A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.

Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:

float* allocate_aligned(size_t n) {
    constexpr size_t alignment = alignof(__m256);
    return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);

In C++-17 new can allocate with alignment:

float* allocate_aligned(size_t n) {
    constexpr auto alignment = std::align_val_t{alignof(__m256)};
    return new(alignment) float[n];
}

Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:

#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>

Vc::float_v random_float_v() {
    alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
    for(unsigned i = 0; i < Vc::float_v::Size; ++i)
        t[i] = std::rand() % 10;
    return Vc::float_v(t, Vc::Aligned);
}

unsigned reverse_crc32(void const* vbegin, void const* vend) {
    unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
    unsigned const* end = reinterpret_cast<unsigned const*>(vend);
    unsigned r = 0;
    while(begin != end)
        r = __builtin_ia32_crc32si(r, *--end);
    return r;
}

int main() {
    constexpr size_t N = 65536;
    constexpr size_t M = N / Vc::float_v::Size;

    std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
    std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);

    for(unsigned i = 0; i < M; ++i) {
        a[i] = random_float_v();
        b[i] = random_float_v();
        c[i] = random_float_v();
    }

    auto t0 = std::chrono::high_resolution_clock::now();
    for(unsigned i = 0; i < M; ++i)
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    auto t1 = std::chrono::high_resolution_clock::now();

    double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
    unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
    std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}

Parallel version:

#include <tbb/parallel_for.h>
// ...
    auto t0 = std::chrono::high_resolution_clock::now();
    tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
        d_intel_avx_omp[i] = a[i] * b[i] + c[i];
    });
    auto t1 = std::chrono::high_resolution_clock::now();

回答2:

You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).

This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.

来源：https://stackoverflow.com/questions/55953452/is-mm256-store-ps-function-is-atomic-while-using-alongside-openmp

标签

openmp

avx

avx2