How can I use openmp and AVX2 simultaneously with perfect answer?

问题

I wrote the Matrix-Vector product program using OpenMP and AVX2.

However, I got the wrong answer because of OpenMP. The true answer is all of the value of array c would become 100.

My answer was mix of 98, 99, and 100.

The actual code is below.

I compiled Clang with -fopenmp, -mavx, -mfma.

#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "omp.h"
#include "x86intrin.h"

void mv(double *a,double *b,double *c, int m, int n, int l)
{
    int k;
#pragma omp parallel
    {
        __m256d va,vb,vc;
        int i;
#pragma omp for private(i, va, vb, vc) schedule(static)
        for (k = 0; k < l; k++) {
            vb = _mm256_broadcast_sd(&b[k]);
            for (i = 0; i < m; i+=4) {
                va = _mm256_loadu_pd(&a[m*k+i]);
                vc = _mm256_loadu_pd(&c[i]);

                vc = _mm256_fmadd_pd(vc, va, vb);

                _mm256_storeu_pd( &c[i], vc );
            }
        }
    }
}
int main(int argc, char* argv[]) {

    // set variables
    int m;
    double* a;
    double* b;
    double* c;
    int i;

    m=100;
    // main program

    // set vector or matrix
    a=(double *)malloc(sizeof(double) * m*m);
    b=(double *)malloc(sizeof(double) * m*1);
    c=(double *)malloc(sizeof(double) * m*1);
    //preset
    for (i=0;i<m;i++) {
        a[i]=1;
        b[i]=1;
        c[i]=0.0;
    }
    for (i=m;i<m*m;i++) {
        a[i]=1;
    }

    mv(a, b, c, m, 1, m);

    for (i=0;i<m;i++) {
        printf("%e\n", c[i]);
    }
    free(a);
    free(b);
    free(c);
    return 0;
}

I know critical section would help. However critical section was slow.

So, how can I solve the problem?

回答1:

The fundamental operation you want is

c[i] = a[i,k]*b[k]

If you use row-major order storage this becomes

c[i] = a[i*l + k]*b[k]

If you use column-major order storage this becomes

c[i] = a[k*m + i]*b[k]

For row-major order you can parallelize like this

#pragma omp parallel for
for(int i=0; i<m; i++) {
  for(int k=0; k<l; k++) {
    c[i] += a[i*l+k]*b[k];
  }
}

For column-major order you can parallelize like this

#pragma omp parallel
for(int k=0; k<l; k++) {
  #pragma omp for
  for(int i=0; i<m; i++) {
    c[i] += a[k*m+i]*b[k];
  }
}

Matrix-vector operations are Level 2 operations which are memory bandwidth bound operation. The Level 1 and Level 2 operations don't scale e.g with the number of cores. It's only the Level 3 operations (e.g. dense matrix multiplication) which scale https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3.

回答2:

The issue is not with your AVX intrinsics, let's look at the code without the intrinsics for a minute:

void mv(double *a,double *b,double *c, int m, int n, int l)
{
    #pragma omp parallel for schedule(static)
    for (int k = 0; k < l; k++) {
        double xb = b[k];
        for (int i = 0; i < m; i++) {
            double xa = a[m*k+i];
            double xc = c[i];
            xc = xc + xa * xb;
            c[i] = xc;
        }
    }
}

Note: your private declaration was technically correct and redundant because declared inside of the parallel loop, but it is just so much easier to reason about the code if you declare the variables as locally as possible.

The race condition on your code is on c[i] - which multiple threads try to update. Now even if you could protect that with say an atomic update, the performance would be horrible: Not only because of the protection, but because the data of c[i] has to be constantly shifted around between caches of different cores.

One thing you can do about this is to use an array reduction on c. This makes a private copy of c for each thread and they get merged at the end:

void mv(double *a,double *b,double *c, int m, int n, int l)
{
    #pragma omp parallel for schedule(static) reduction(+:c[:m])
    for (int k = 0; k < l; k++) {
        for (int i = 0; i < m; i++) {
            c[i] += a[m*k+i] * b[k];
        }
    }
}

This should be reasonably efficient as long as two m-vectors fit in your cache but you still may get a lot of overhead due to thread management overhead. Eventually you will be limited by memory bandwidth because in a vector-matrix multiplication you only have one computation per element read from a.

Anyway, you can of course swap i and k loops and save the reduction, but then your memory access pattern on a will be inefficient (strided) - so you should block the loop to avoid that.

Now if you look at the output of any modern compiler, it will generate SIMD code on its own. Of course you can apply your own SIMD intrinsics if you want to. But make sure that you handle the edge cases correctly if m is not divisible by 4 (you did not in your original version).

At the end of the day, if you really want performance - use the functions from a BLAS library (e.g. MKL). If you want to play around with optimization, there are ample of opportunities to go in deep details.

来源：https://stackoverflow.com/questions/51120193/how-can-i-use-openmp-and-avx2-simultaneously-with-perfect-answer

标签

multithreading

openmp

avx2