发表新帖

发表新帖

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

后端未结

关注

 5  1373

我寻月下人不归 2020-12-25 09:17

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:

Randomly generate two 4096x4096 matrixes X, Y from

5条回答

北海茫月 (楼主)

2020-12-25 09:47

This is quite complex topic, and well answered by Eric, in the post above. I just want to point to a useful reference in this direction, page 84:

http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/HPC4SE/

which suggests to make "loop unroll and jam" on top of blocking.

Can anyone explain this difference?

A general explanation is that, the ratio of the number of operations/number of data is O(N^3)/O(N^2). Thus matrix-matrix multiplication is a cache-bound algorithm, which means that you don't suffer from common memory-bandwidth bottleneck, for large matrix sizes. You can get up to 90% of peak performance of your CPU if the code well-optimized. So the optimization potential, elaborated by Eric, is tremendous as you observed. Actually, it would be very interesting to see the best performing code, and compile your final program with another compiler (intel usually brags to be the best).

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题