问题

I have a school task about paralel programming and I'm having a lot of problems with it. My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}

This is what I came up with so far:

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
#pragma omp parallel
    {
#pragma omp for schedule(static, 16)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    }
}

And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).

Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?

EDIT:

I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU @ 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)

I'm using global arrays:

float matrix_a[SIZE][SIZE];    
float matrix_b[SIZE][SIZE];    
float matrix_r[SIZE][SIZE];

I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.

I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache. My current version of code looks like this:

void multiply_matrices_KIJ()
{
#pragma omp parallel 
    {
    for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
        for (int i = 0; i < SIZE; i++) {
            for (int j = 0; j < SIZE; j++) {
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
            }
        }
    }
    }
}

And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).

These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):

SIZE = 128

Serial version : 0,000608s
Parallel I, schedule(dynamic, 16): 0,000683s
Parallel I, schedule(static, 16): 0,000647s
Parallel J, no schedule: 0,001978s (this is where I exected way slower execution)

SIZE = 256

Serial version: 0,005787s
Parallel I, schedule(dynamic, 16): 0,005125s
Parallel I, schedule(static, 16): 0,004938s
Parallel J, no schedule: 0,013916s

SIZE = 1024

Serial version: 0,930250s
Parallel I, schedule(dynamic, 16): 0,865750s
Parallel I, schedule(static, 16): 0,823750s
Parallel J, no schedule: 1,137000s

回答1:

Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.

Loop order

OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.

You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix.
The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).

You can achieve both with the following loop/index order:

for i = 0 to a_rows
  for k = 0 to a_cols
    for j = 0 to b_cols
      r[i][j] = a[i][k]*b[k][j]

Where

j changes faster than i or k and k changes faster than i.
i is a result matrix subscript and the i loop can run parallel

Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already.

I did some short tests and the code I used to compare the timings is:

template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows, 
  std::size_t const b_cols, T * const matrix_r)
{
  for (std::size_t k = 0; k < a_cols; k++)
  {
    for (std::size_t i = 0; i < a_rows; i++)
    {
      for (std::size_t j = 0; j < b_cols; j++)
      {
        matrix_r[i*b_cols + j] += 
          matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
      }
    }
  }
}

mimicing your multiply_matrices_KIJ() function versus

template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
  for (std::size_t i = 0; i < a_rows; ++i)
  { 
    T * const r_row_p = r_matrix + i*b_cols;
    for (std::size_t k = 0; k < a_cols; ++k)
    { 
      auto const a_val = a_matrix[i*a_cols + k];
      T const * const b_row_p = b_matrix + k * b_cols;
      for (std::size_t j = 0; j < b_cols; ++j)
      { 
        r_row_p[j] += a_val * b_row_p[j];
      }
    }
  }
}

implementing the above mentioned order.

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k

mm_kij(): 6.16706s.

mm_opt(): 2.6567s.

The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:

template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
  #pragma omp parallel
  {
    auto ar = static_cast<std::ptrdiff_t>(a_rows);
    #pragma omp for schedule(static) nowait
    for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
    for (std::size_t i = 0; i < a_rows; ++i)
#endif
    {
      T * const r_row_p = r_matrix + i*b_cols;
      for (std::size_t k = 0; k < b_rows; ++k)
      {
        auto const a_val = a_matrix[i*a_cols + k];
        T const * const b_row_p = b_matrix + k * b_cols;
        for (std::size_t j = 0; j < b_cols; ++j)
        {
          r_row_p[j] += a_val * b_row_p[j];
        }
      }
    }
#if defined(_OPENMP)
  }
#endif
}

Where each thread writes to an individual result row

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)

mm_kij(): 6.16706s.

mm_opt(): 2.6567s.

mm_opt_par(): 0.968325s.

Not perfect scaling but as a start faster than the serial code.

回答2:

OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. Nevertheless, there is a barrier between each parallel region so that all threads have to sync. There is probably some additional overhead in the fork join model between parallel regions. So even though the threads don't have to be recreated they still have to be initialized between parallel regions. More details can be found here.

In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i like this:

void multiply_matrices_KIJ() {
    #pragma omp parallel
    for (int k = 0; k < SIZE; k++)
        #pragma omp for schedule(static) nowait
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}

There is an implicit barrier when using #pragma omp for. The nowait clause removes the barrier.

Also make sure you compile with optimizing. There is little point in comparing performance without optimization enabled. I would use -O3.

回答3:

Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. In your case, that means I,K,L order. I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have "-O3"). However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region.

If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering...

void multiply_matrices_KIJ()
{
    #pragma omp parallel for
    for (int k = 0; k < SIZE; k++)
    {
        for (int i = 0; i < SIZE; i++)
            #pragma omp simd
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    }
}

来源：https://stackoverflow.com/questions/35631216/matrix-multiplication-kij-order-parallel-version-slower-than-non-parallel

标签

c++

matrix

openmp

Matrix multiplication, KIJ order, Parallel version slower than non-parallel

问题