OpenMP parallelization (Block Matrix Mult)

问题

I'm attempting to implement block matrix multiplication and making it more parallelized.

This is my code :

int i,j,jj,k,kk;
float sum;
int en = 4 * (2048/4);
    #pragma omp parallel for collapse(2) 
for(i=0;i<2048;i++) {
    for(j=0;j<2048;j++) {
        C[i][j]=0;
    }
}
for (kk=0;kk<en;kk+=4) {
    for(jj=0;jj<en;jj+=4) {
        for(i=0;i<2048;i++) {
            for(j=jj;j<jj+4;j++) {
                sum = C[i][j];
                for(k=kk;k<kk+4;k++) {
                    sum+=A[i][k]*B[k][j];
                }
                C[i][j] = sum;
            }
        }
    }
}

I've been playing around with OpenMP but still have had no luck in figuring what the best way to have this done in the least amount of time.

回答1:

Getting good performance from matrix multiplication is a big job. Since "The best code is the code I don't have to write", a much better use of your time would be to understand how to use a BLAS library.

If you are using X86 processors, the Intel Math Kernel Library (MKL) is available free, and includes optimized, parallelized, matrix multiplication operations. https://software.intel.com/en-us/articles/free-mkl

(FWIW, I work for Intel, but not on MKL :-))

回答2:

I recently started looking into dense matrix multiplication (GEMM)again. It turns out the Clang compiler is really good at optimization GEMM without needing any intrinsics (GCC still needs intrinsics). The following code gets 60% of the peak FLOPS of my four core/eight hardware thread Skylake system. It uses block matrix multiplication.

Hyper-threading gives worse performance so you make sure you only use threads equal to the number of cores and bind threads to prevent thread migration.

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=4

Then compile like this

clang -Ofast -march=native -fopenmp -Wall gemm_so.c

The code

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>

#define SM 80

typedef __attribute((aligned(64))) float * restrict fast_float;

static void reorder2(fast_float a, fast_float b, int n) {
  for(int i=0; i<SM; i++) memcpy(&b[i*SM], &a[i*n], sizeof(float)*SM);
}

static void kernel(fast_float a, fast_float b, fast_float c, int n) {
  for(int i=0; i<SM; i++) {
    for(int k=0; k<SM; k++) {
      for(int j=0; j<SM; j++) {
        c[i*n + j] += a[i*n + k]*b[k*SM + j];
      }
    }
  }
}

void gemm(fast_float a, fast_float b, fast_float c, int n) {
  int bk = n/SM;

  #pragma omp parallel
  {
    float *b2 = _mm_malloc(sizeof(float)*SM*SM, 64);
    #pragma omp for collapse(3)
    for(int i=0; i<bk; i++) {
      for(int j=0; j<bk; j++) {
        for(int k=0; k<bk; k++) {
          reorder2(&b[SM*(k*n + j)], b2, n);
          kernel(&a[SM*(i*n+k)], b2, &c[SM*(i*n+j)], n);
        }
      }
    }
    _mm_free(b2);
  }
}

static int doublecmp(const void *x, const void *y) { return *(double*)x < *(double*)y ? -1 : *(double*)x > *(double*)y; }

double median(double *x, int n) {
  qsort(x, n, sizeof(double), doublecmp);
  return 0.5f*(x[n/2] + x[(n-1)/2]);
}

int main(void) {
  int cores = 4;
  double frequency = 3.1; // i7-6700HQ turbo 4 cores
  double peak = 32*cores*frequency;

  int n = SM*10*2;

  int mem = sizeof(float) * n * n;
  float *a = _mm_malloc(mem, 64);
  float *b = _mm_malloc(mem, 64);
  float *c = _mm_malloc(mem, 64);

  memset(a, 1, mem), memset(b, 1, mem);

  printf("%dx%d matrix\n", n, n);
  printf("memory of matrices: %.2f MB\n", 3.0*mem*1E-6);
  printf("peak SP GFLOPS %.2f\n", peak);
  puts("");

  while(1) {
    int r = 10;
    double times[r];
    for(int j=0; j<r; j++) {
      times[j] = -omp_get_wtime();
      gemm(a, b, c, n);
      times[j] += omp_get_wtime();
    }

    double flop = 2.0*1E-9*n*n*n;  //GFLOP
    double time_mid = median(times, r);
    double flops_low  = flop/times[r-1], flops_mid  = flop/time_mid, flops_high = flop/times[0];
    printf("%.2f %.2f %.2f %.2f\n", 100*flops_low/peak, 100*flops_mid/peak, 100*flops_high/peak, flops_high);
  }
}

This does GEMM 10 times per iteration of an infinite loop and prints the low, median, and high ratio of FLOPS to peak_FLOPS and finally the median FLOPS.

You will need to adjust the following lines

int cores = 4;  
double frequency = 3.1;  // i7-6700HQ turbo 4 cores 
double peak = 32*cores*frequency;

to the number of physical cores, frequency for all cores (with turbo if enabled), and the number of floating pointer operations per core which is 16 for Core2-Ivy Bridge, 32 for Haswell-Kaby Lake, and 64 for the Xeon Phi Knights Landing.

This code may be less efficient with NUMA systems. It does not do nearly as well with Knight Landing (I just started looking into this).

来源：https://stackoverflow.com/questions/43483324/openmp-parallelization-block-matrix-mult

标签

matrix

openmp

matrix-multiplication