Speed up matrix multiplication by SSE (C++)

问题

I need to run a matrix-vector multiplication 240000 times per second. The matrix is 5x5 and is always the same, whereas the vector changes at each iteration. The data type is float. I was thinking of using some SSE (or similar) instructions.

1) I am concerned that the number of arithmetic operations is too small compared to the number of memory operations involved. Do you think I can get some tangible (e.g. > 20%) improvement?

2) Do I need the Intel compiler to do it?

3) Can you point out some references?

Thanks everybody!

回答1:

The Eigen C++ template library for vectors, matrices, ... has both

optimised code for small fixed size matrices (as well as dynamically sized ones)
optimised code that uses SSE optimisations

so you should give it a try.

回答2:

If you're using GCC, note that the -O3 option will enable auto-vectorization, which will automatically generate SSE or AVX instructions in many cases. In general, if you just write it as a simple for-loop, GCC will vectorize it. See http://gcc.gnu.org/projects/tree-ssa/vectorization.html for more information.

回答3:

In principle the speedup could be 4 times with SSE (8 times with AVX). Let me explain.

Let's call your fixed 5x5 matrix M. Defining the components of a 5D vector as (x,y,z,w,t). Now form a 5x4 matrix U from the first four vectors.

U =
xxxx
yyyy
zzzz
wwww
tttt

Next, do the matrix product MU = V. The matrix V contains the product of M and the first four vectors. The only problem is that for SSE we need read in the rows of U but in memory U is stored as xyzwtxyzwtxyzwtxyzwt so we have to transpose it to xxxxyyyyzzzzwwwwtttt. This can be done with shuffles/blends in SSE. Once we have this format the matrix product is very efficient.

Instead of taking O(5x5x4) operations with scalar code it only takes O(5x5) operations i.e. a 4x speedup. With AVX the matrix U will be 5x8 so instead of taking O(5x5x8) operations it only taxes O(5x5), i.e. a 8x speedup.

The matrix V, however, will be in xxxxyyyyzzzzwwwwtttt format so depending on the application it might have to be transposed to xyzwtxyzwtxyzwtxyzwt format.

Repeat this for the next four vectors (8 for AVX) and so forth until done.

If you have control over the vectors, for example if your application generates the vectors on the fly, then you can generate them in xxxxyyyyzzzzwwwwtttt format and avoid the transpose of the array. In that case you should get a 4x speed up with SSE and a 8x with AVX. If you combine this with threading, e.g. OpenMP, your speedup should be close to 16x (assuming four physical cores) with SSE. I think that's the best you can do with SSE.

Edit: Due to instruction level parallelism (ILP) you can get another factor of 2 in speedup so the speedup for SSE could 32x with four cores (64x AVX) and again another factor of 2 with Haswell due to FMA3.

回答4:

I would suggest using Intel IPP and abstract yourself of dependency on techniques

回答5:

This should be easy, especially when You're on Core 2 or later: You neeed 5* _mm_dp_ps , one _mm_mul_ps, two _mm_add_ps, one ordinary multiplication, plus some shuffles, loads and stores (and if the matrix is fixed, You can keep most of it in sse registers, if You don`t need them for anything else).

As for memory bandwidth: we`re talking about 2,4 megabytes of vectors , when memory bandwidths are in single-digit gigabytes per second.

回答6:

What is known about the vector? Since the matrix is fixed, AND if there is a limited amount of values that the vector can take, then I'd suggest that you pre-compute the calculations and access them using a table look-up.

The classic optimization technique to trade memory for cycles...

回答7:

I would recommend having a look at an optimised BLAS library, such as the Intel MKL or the AMD ACML. Based on your description I would assume that you'd be after the SGEMV level 2 matrix-vector routine, to do y = A*x style operations.

If you really want to implement something yourself, using the (available) SSE..SSE4 and AVX instruction sets can offer significant performance improvements in some cases, although this is exactly what a good BLAS library will be doing. You also need to think alot about cache friendly data access patterns.

I don't know if this is applicable in your case, but can you operate on "chunks" of vectors at a time?? So rather than repeatedly doing an y = A*x style operation can you operate on blocks of [y1 y2 ... yn] = A * [x1 x2 ... xn]. If so, this means that you could use an optimised matrix-matrix routine, such as SGEMM. Due to the data access patterns this may be significantly more efficient than repeated calls to SGEMV. If it were me, I would try to go down this path...

Hope this helps.

回答8:

If you know the vectors in advance (e.g., doing all 240k at once), you'd get a better speedup by parallelising the loop than by going to SSE. If you've already taken that step, or you don't know them all at once, SSE could be a big benefit.

If the memory is contiguous, then don't worry too much about the memory operations. If you've got a linked list or something then you're in trouble, but it should be able to keep up without too much problem.

5x5 is a funny size, but you could do at least 4 flops in one SSE instruction and try to cut your arithmetic overheads. You don't need the Intel compiler, but it might be better, I've heard legends about how it's much better with arithmetic code. Visual Studio has intrinsics for dealing with SSE2, and I think up to SSE4 depending on what you need. Of course, you'd have to roll it yourself. Grabbing a library might be the smart move here.

来源：https://stackoverflow.com/questions/6617688/speed-up-matrix-multiplication-by-sse-c

标签

c++

sse

matrix-multiplication