Automatic vectorization of matrix multiplication

问题

I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me.

So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor)

I essentially have this function:

/* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */
void mmul(double **m1, double **m2, double **m3, int N, int M, int P)
{
    for (i = 0; i < N; i++)
        for (j = 0; j < P; j++)
        {
            double tmp = 0.0;

            for (k = 0; k < M; k++)
                tmp += m1[i][k] * m2[k][j];

            tmp = m3[i][j];
        }
    return m3;
}

Which I compile with -O2 -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5, however I don't see any message that the vectorization was done.

If anyone could help me out, that would be very much appreciated.

回答1:

There is no message for vectorization done in you command! You can use -fopt-info-vec to turn the vectorization report on. But, do not rely on it. Compiler sometimes lies (They vectorize and report it but don't use it!) you can chek the improvements!For this purpose, you can measure the speedup. First, disable vectorization and measure the time t1. Then enable and measure the time t2. The speed up will be t1/t2 if it's bigger than 1 it says compiler improved if 1 no improvement if less than one it says compiler auto-vectorizer ruined that for you! Another way you can add -S to your command and see the assembly codes in a separated .s file.

NOTE: if you want to see the autovectorization power add -march=native and delete that -msse2.

UPDATE: When you use a variable such a N,M, etc. as the loop counter you might not see vectorization. Thus, you should have used constants instead. In my experience, the matrix-matrix multiplication is vectorizable using gcc 4.8, 5.4 and 6.2. Other compilers such as clang-LLVM, ICC and MSVC vectorize it as well. As mentioned in comments if you use double or float datatypes you might need to use -ffast-math which is an enabled flag in -Ofast optimization level, to say you don't need a high-accuracy result (It's OK most of the times). Its because ompilers are more carful about floting-point operations.

来源：https://stackoverflow.com/questions/43243244/automatic-vectorization-of-matrix-multiplication

标签

gcc

vectorization

sse

simd