iOS 4 Accelerate Cblas with 4x4 matrices

时光总嘲笑我的痴心妄想 提交于 2019-12-05 07:03:43

The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). They will deliver correct results for smaller matrices, but the performance will not be as good as it could be.

There are several reasons for this:

  • In order to deliver top performance, 3x3 and 4x4 matrix operations must be inlined, not in a library; the overhead of making a function call is simply too large to overcome when there is so little work to be done.
  • An entirely different set of interfaces is necessary to deliver top performance. The BLAS interface for matrix multiply takes variables to specify the sizes and leading dimensions of the matrices involved in the computation, not to mention whether or not to transpose the matrices and the storage layout. All those parameters make the library powerful, and don't hurt performance for large matrices. However, by the time it has finished determining that you are doing a 4x4 computation, a function dedicated to doing 4x4 matrix operations and nothing else is already finished.

What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature.

Shervin Emami

The Apple WWDC2010 presentations say that Accelerate should still give a speedup for even a 3x3 matrix operation, so I would have assumed you should see a slight improvement for 4x4. But something you need to consider is that Accelerate & NEON are designed to greatly speed up integer operations but not necessarily floating-point operations. You didn't mention your CPU processor, and it seems that Accelerate will use either NEON or VFP for floating-point operations depending on your CPU. If it uses NEON instructions for 32-bit float operations then it should run fast, but if it uses VFP for 32-bit float or 64-bit double operations, then it will run very slowly (since VFP is not actually SIMD). So you should make sure that you are using 32-bit float operations with Accelerate, and make sure it will use NEON instead of VFP.

And another issue is that even if it does use NEON, there is no guarantee that your C compiler will generate faster NEON code than your simple C function does without NEON instructions, because C compilers such as GCC often generate terrible SIMD code, potentially running slower than standard code. Thats why its always important to test the speed of the generated code, and possibly to manually look at the generated assembly code to see if your compiler generated bad code.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!