Matrix-Multiplication: Why non-blocked outperforms blocked?

问题

I'm trying to speed up a matrix multiplication algorithm by blocking the loops to improve cache performance, yet the non-blocked version remains significantly faster regardless of matrix size, block size (I've tried lots of values between 2 and 200, potenses of 2 and others) and optimization level.

Non-blocked version:

  for(size_t i = 0; i < n; ++i)
  {
    for(size_t k = 0; k < n; ++k)
    {
      int r = a[i][k];
      for(size_t j = 0; j < n; ++j)
      {
        c[i][j] += r * b[k][j];
      }
    }
  }

Blocked version:

  for(size_t kk = 0; kk < n; kk += BLOCK)
  {
    for(size_t jj = 0; jj < n; jj += BLOCK)
    {
      for(size_t i = 0; i < n; ++i)
      {
        for(size_t k = kk; k < kk + BLOCK; ++k)
        {
          int r = a[i][k];
          for(size_t j = jj; j < jj + BLOCK; ++j)
          {
            c[i][j] += r * b[k][j];
          }
        }
      }
    }
  }

I also have a bijk version and a 6-loops bikj version but they all gets outperformed by the non-blocked version and I don't get why this happens. Every paper and tutorial that I've come across seems to indicate that the the blocked version should be significantly faster. I'm running this on a Core i5 if that matters.

回答1:

Try blocking in one dimension only, not in both dimensions.

Matrix multiplication exhaustively processes elements from both matrices. Each row vector on the left matrix is repeatedly processed, taken into successive columns of the right matrix.

If the matrices do not both fit into the cache, some data will invariably end up loaded multiple times.

What we can do is break up the operation so that we work with about a cache-sized amount of data at one time. We want the row vector from the left operand to be cached, since it is repeatedly applied against multiple columns. But we should only take enough columns (at a time) to stay within the limit of the cache. For instance, if we can only take 25% of the columns, it means we will have to pass over the row vectors four times. We end up loading the left matrix from memory four times, and the right matrix only once.

(If anything is to be loaded more than once, it should be the row vectors on the left, because they are flat in memory, which benefits from burst loading. Many cache architectures can perform a burst load from memory into adjacent cache lines faster than random access loads. If the right matrix were stored in column-major order, that would be even better: then we are doing cross-products between flat arrays, which prefetch into memory nicely.)

Let's also not forget the output matrix. The output matrix occupies space in the cache also.

I suspect one flaw in the 2D blocked approach is that each element of the output matrix depends on two inputs: its entire entire row in the left matrix, and the entire column in the right matrix. If the matrices are visited in blocks, that means that each target element is visited multiple times to accumulate the partial result.

If we do a complete row-column dot product, we don't have to visit the c[i][j] more than once; once we take column j into row i, we are done with that c[i][j].

来源：https://stackoverflow.com/questions/38190006/matrix-multiplication-why-non-blocked-outperforms-blocked

标签

caching

matrix-multiplication

cpu-architecture