Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

后端 未结 5 1373
我寻月下人不归
我寻月下人不归 2020-12-25 09:17

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:

  1. Randomly generate two 4096x4096 matrixes X, Y from
5条回答
  •  北海茫月
    2020-12-25 09:47

    This is quite complex topic, and well answered by Eric, in the post above. I just want to point to a useful reference in this direction, page 84:

    http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/HPC4SE/

    which suggests to make "loop unroll and jam" on top of blocking.

    Can anyone explain this difference?

    A general explanation is that, the ratio of the number of operations/number of data is O(N^3)/O(N^2). Thus matrix-matrix multiplication is a cache-bound algorithm, which means that you don't suffer from common memory-bandwidth bottleneck, for large matrix sizes. You can get up to 90% of peak performance of your CPU if the code well-optimized. So the optimization potential, elaborated by Eric, is tremendous as you observed. Actually, it would be very interesting to see the best performing code, and compile your final program with another compiler (intel usually brags to be the best).

提交回复
热议问题