iOS 4 Accelerate Cblas with 4x4 matrices

I’ve been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can’t get the use of these functions to give me any performance gain over very basic routines. Specifically, the case of 4x4 matrix multiplication. Wherever I couldn’t make use of affine or homogeneous properties of the matrices, I’ve been using this routine (abridged):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

The equivalent function call for Cblas is:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function.

This doesn’t seem right to me, and I’m left with the feeling that I’m doing something wrong somewhere. Do I have to to enable the device’s NEON unit and SIMD functionality somehow? Or shouldn’t I hope for better performance with such small matrices?

Very much appreciated,

Bastiaan

The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). They will deliver correct results for smaller matrices, but the performance will not be as good as it could be.

There are several reasons for this:

In order to deliver top performance, 3x3 and 4x4 matrix operations must be inlined, not in a library; the overhead of making a function call is simply too large to overcome when there is so little work to be done.
An entirely different set of interfaces is necessary to deliver top performance. The BLAS interface for matrix multiply takes variables to specify the sizes and leading dimensions of the matrices involved in the computation, not to mention whether or not to transpose the matrices and the storage layout. All those parameters make the library powerful, and don't hurt performance for large matrices. However, by the time it has finished determining that you are doing a 4x4 computation, a function dedicated to doing 4x4 matrix operations and nothing else is already finished.

What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature.

Shervin Emami

The Apple WWDC2010 presentations say that Accelerate should still give a speedup for even a 3x3 matrix operation, so I would have assumed you should see a slight improvement for 4x4. But something you need to consider is that Accelerate & NEON are designed to greatly speed up integer operations but not necessarily floating-point operations. You didn't mention your CPU processor, and it seems that Accelerate will use either NEON or VFP for floating-point operations depending on your CPU. If it uses NEON instructions for 32-bit float operations then it should run fast, but if it uses VFP for 32-bit float or 64-bit double operations, then it will run very slowly (since VFP is not actually SIMD). So you should make sure that you are using 32-bit float operations with Accelerate, and make sure it will use NEON instead of VFP.

And another issue is that even if it does use NEON, there is no guarantee that your C compiler will generate faster NEON code than your simple C function does without NEON instructions, because C compilers such as GCC often generate terrible SIMD code, potentially running slower than standard code. Thats why its always important to test the speed of the generated code, and possibly to manually look at the generated assembly code to see if your compiler generated bad code.

来源：https://stackoverflow.com/questions/3950383/ios-4-accelerate-cblas-with-4x4-matrices

标签

iphone

ios

blas