parallelizing matrix multiplication through threading and SIMD

删除回忆录丶 提交于 2019-12-05 02:16:53

Here are the times I get building on your algorithm on my four core i7 IVB processor.

sequential:         3.42 s
4 threads:          0.97 s
4 threads + SSE:    0.86 s

Here are the times on a 2 core P9600 @2.53 GHz which is similar to the OP's E2200 @2.2 GHz

sequential: time    6.52 s
2 threads: time     3.66 s
2 threads + SSE:    3.75 s

I used OpenMP because it makes this easy. Each thread in OpenMP runs over effectively

lowerBound = N*part/CORES;
upperBound = N*(part + 1)/CORES;

(note that that is slightly different than your definition. Your definition can give the wrong result due to rounding for some values of N since you divide by CORES first.)

As to the SIMD version. It's not much faster probably due it being memory bandwidth bound . It's probably not really faster because GCC already vectroizes the loop.

The most optimal solution is much more complicated. You need to use loop tiling and reorder the elements within tiles to get the optimal performance. I don't have time to do that today.

Here is the code I used:

//c99 -O3 -fopenmp -Wall foo.c
#include <stdio.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>

void gemm(float * restrict a, float * restrict b, float * restrict c, int n) {
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}

void gemm_tlp(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            for(int j=0; j<n; j++) {
                c[i*n+j] += a[i*n+k]*b[k*n+j];
            }
        }
    }
}   

void gemm_tlp_simd(float * restrict a, float * restrict b, float * restrict c, int n) {
    #pragma omp parallel for
    for(int i=0; i<n; i++) {
        for(int k=0; k<n; k++) {
            __m128 a4 = _mm_set1_ps(a[i*n+k]);
            for(int j=0; j<n; j+=4) {
                __m128 c4 = _mm_load_ps(&c[i*n+j]);
                __m128 b4 = _mm_load_ps(&b[k*n+j]);
                c4 = _mm_add_ps(_mm_mul_ps(a4,b4),c4);
                _mm_store_ps(&c[i*n+j], c4);
            }
        }
    }
}

int main(void) {
    int n = 2048;
    float *a = _mm_malloc(n*n * sizeof *a, 64);
    float *b = _mm_malloc(n*n * sizeof *b, 64);
    float *c1 = _mm_malloc(n*n * sizeof *c1, 64);
    float *c2 = _mm_malloc(n*n * sizeof *c2, 64);
    float *c3 = _mm_malloc(n*n * sizeof *c2, 64);
    for(int i=0; i<n*n; i++) a[i] = 1.0*i;
    for(int i=0; i<n*n; i++) b[i] = 1.0*i;
    memset(c1, 0, n*n * sizeof *c1);
    memset(c2, 0, n*n * sizeof *c2);
    memset(c3, 0, n*n * sizeof *c3);
    double dtime;

    dtime = -omp_get_wtime();
    gemm(a,b,c1,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp(a,b,c2,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    gemm_tlp_simd(a,b,c3,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);
    printf("error %d\n", memcmp(c1,c2, n*n*sizeof *c1));
    printf("error %d\n", memcmp(c1,c3, n*n*sizeof *c1));
}

It looks to me that the threads are sharing __m128 mmx* variables, you probably defined them global/static. You must be getting wrong results in your X array too. Define __m128 mmx* variables inside threadedSIMDMatMul function scope and it will run much faster.

void threadedSIMDMatMul(void* params)
{
    __m128 mmx0, mmx1, mmx2, mmx3, mmx4;
    // rest of the code here
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!