Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

杀马特。学长 韩版系。学妹 提交于 2019-11-30 03:52:28
Eric Postpischil

Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).

In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.

Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.

Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.

High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.

Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.

These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.

Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.

It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.

Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 1003 / 1002.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.

As for the later optimizations, consider this code in the inner loop:

bufz[trow][tcol] += B * bufy[tk][tcol];

One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.

Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).

Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.

One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.

Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.

That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).

When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.

Also, does Strassen’s algorithm account for any of the remaining difference?

Strassen's algorithm has two advantages over the naïve algorithm:

  1. Better time complexity in terms of number of operations, as other answers correctly point out;
  2. It is a cache-oblivious algorithm. The difference in number of cache misses is in the order of B*M½, where B is the cache line size and M is the cache size.

I think that the second point accounts for a lot for the slowdown you are experiencing. If you are running your application under Linux, I suggest you run them with the perf tool, which tells you how many cache misses the program is experiencing.

I don't know how reliable the information is but Wikipedia says that BLAS uses Strassen's algorithm for big matrixes. And yours are big indeed. That is around O(n^2.807) which is better than your O(n^3) naïve alogrithm.

This is quite complex topic, and well answered by Eric, in the post above. I just want to point to a useful reference in this direction, page 84:

http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/HPC4SE/

which suggests to make "loop unroll and jam" on top of blocking.

Can anyone explain this difference?

A general explanation is that, the ratio of the number of operations/number of data is O(N^3)/O(N^2). Thus matrix-matrix multiplication is a cache-bound algorithm, which means that you don't suffer from common memory-bandwidth bottleneck, for large matrix sizes. You can get up to 90% of peak performance of your CPU if the code well-optimized. So the optimization potential, elaborated by Eric, is tremendous as you observed. Actually, it would be very interesting to see the best performing code, and compile your final program with another compiler (intel usually brags to be the best).

About half of the difference is accounted for in algorithmic improvement. (4096*4096)^3 is the complexity of your algorithm, or 4.7x10^21, and (4096*4096)^2.807 is 1x10^20. That's a difference of about 47x.

The other 2x will be accounted for by more intelligent use of the cache, SSE instructions, and other such low-level optimizations.

Edit: I lie, n is width, not width^2. The algorithm would only actually account for about 4x, so there's still about another 22x to go. Threads, cache, and SSE instructions may well account for such things.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!