问题
I have a school task about paralel programming and I'm having a lot of problems with it. My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order):
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
This is what I came up with so far:
void multiply_matrices_KIJ()
{
for (int k = 0; k < SIZE; k++)
#pragma omp parallel
{
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
And that's where i found something confusing to me. This parallel version of the code is running around 50% slower than non-parallel one. The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far).
Can someone help me understand what am I doing wrong? Is it maybe because I'm using the KIJ order and it won't get any faster using openMP?
EDIT:
I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). My CPU is: Intel Core i5-2520M CPU @ 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC)
I'm using global arrays:
float matrix_a[SIZE][SIZE];
float matrix_b[SIZE][SIZE];
float matrix_r[SIZE][SIZE];
I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s.
I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). For some of them it is intended NOT to fit in cache. My current version of code looks like this:
void multiply_matrices_KIJ()
{
#pragma omp parallel
{
for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
}
}
}
And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. My task is to do the KIJ for loops in parallel and check the performence increase. My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]).
These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time):
SIZE = 128
- Serial version : 0,000608s
- Parallel I, schedule(dynamic, 16): 0,000683s
- Parallel I, schedule(static, 16): 0,000647s
- Parallel J, no schedule: 0,001978s (this is where I exected way slower execution)
SIZE = 256
- Serial version: 0,005787s
- Parallel I, schedule(dynamic, 16): 0,005125s
- Parallel I, schedule(static, 16): 0,004938s
- Parallel J, no schedule: 0,013916s
SIZE = 1024
- Serial version: 0,930250s
- Parallel I, schedule(dynamic, 16): 0,865750s
- Parallel I, schedule(static, 16): 0,823750s
- Parallel J, no schedule: 1,137000s
回答1:
Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. I'll try to give some advice on how to improve the order (and parallelize it) instead.
Loop order
OpenMP is usually used to distribute work over several CPUs. Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer.
You want to execute the outermost loop in parallel instead of the second one. Therefore, you'll want to have one of the
r_matrix
indices as outer loop index in order to avoid race conditions when writing to the result matrix.The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index).
You can achieve both with the following loop/index order:
for i = 0 to a_rows
for k = 0 to a_cols
for j = 0 to b_cols
r[i][j] = a[i][k]*b[k][j]
Where
j
changes faster thani
ork
andk
changes faster thani
.i
is a result matrix subscript and thei
loop can run parallel
Rearranging your multiply_matrices_KIJ
in that way gives quite a bit of a performance boost already.
I did some short tests and the code I used to compare the timings is:
template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows,
std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows,
std::size_t const b_cols, T * const matrix_r)
{
for (std::size_t k = 0; k < a_cols; k++)
{
for (std::size_t i = 0; i < a_rows; i++)
{
for (std::size_t j = 0; j < b_cols; j++)
{
matrix_r[i*b_cols + j] +=
matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
}
}
}
}
mimicing your multiply_matrices_KIJ()
function versus
template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
for (std::size_t i = 0; i < a_rows; ++i)
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < a_cols; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
}
implementing the above mentioned order.
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k
mm_kij()
: 6.16706s.
mm_opt()
: 2.6567s.
The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix:
template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows,
std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows,
std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
#pragma omp parallel
{
auto ar = static_cast<std::ptrdiff_t>(a_rows);
#pragma omp for schedule(static) nowait
for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
for (std::size_t i = 0; i < a_rows; ++i)
#endif
{
T * const r_row_p = r_matrix + i*b_cols;
for (std::size_t k = 0; k < b_rows; ++k)
{
auto const a_val = a_matrix[i*a_cols + k];
T const * const b_row_p = b_matrix + k * b_cols;
for (std::size_t j = 0; j < b_cols; ++j)
{
r_row_p[j] += a_val * b_row_p[j];
}
}
}
#if defined(_OPENMP)
}
#endif
}
Where each thread writes to an individual result row
Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads)
mm_kij()
: 6.16706s.
mm_opt()
: 2.6567s.
mm_opt_par()
: 0.968325s.
Not perfect scaling but as a start faster than the serial code.
回答2:
OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. Nevertheless, there is a barrier between each parallel region so that all threads have to sync. There is probably some additional overhead in the fork join model between parallel regions. So even though the threads don't have to be recreated they still have to be initialized between parallel regions. More details can be found here.
In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i
like this:
void multiply_matrices_KIJ() {
#pragma omp parallel
for (int k = 0; k < SIZE; k++)
#pragma omp for schedule(static) nowait
for (int i = 0; i < SIZE; i++)
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
There is an implicit barrier when using #pragma omp for
. The nowait
clause removes the barrier.
Also make sure you compile with optimizing. There is little point in comparing performance without optimization enabled. I would use -O3
.
回答3:
Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. In your case, that means I,K,L order. I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have "-O3
"). However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region.
If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering...
void multiply_matrices_KIJ()
{
#pragma omp parallel for
for (int k = 0; k < SIZE; k++)
{
for (int i = 0; i < SIZE; i++)
#pragma omp simd
for (int j = 0; j < SIZE; j++)
matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}
}
来源:https://stackoverflow.com/questions/35631216/matrix-multiplication-kij-order-parallel-version-slower-than-non-parallel