Vectorized Trig functions in C?

筅森魡賤 提交于 2020-01-12 07:30:27

问题


I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.

When I compile a block

for(int i=0; i<SIZE; i++) {
   arr[i]=sin((float)i/1024);
}

GCC won't vectorize it, and says

not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);

Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.

With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.

A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?

EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.


回答1:


Since you said you were using GCC it looks like there are some options:

  • http://gruntthepeon.free.fr/ssemath/
    • This uses SSE and SSE2 instructions to implement it.
  • http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
    • This has an alternate implementation. Some of the comments are pretty good.

That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.

  • https://code.google.com/p/slmath/
  • https://code.google.com/p/thrust/



回答2:


Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.




回答3:


What platform are you using? Many libraries of this sort already exist:

  • Intel's provides the Vector Math Library (VML) with icc.
  • Apple provides the vForce library as part of the Accelerate framework.
  • HP provides their own Vector Math Library for Itanium (and may other architectures, too).
  • Sun provided libmvec with their compiler tools.
  • ...



回答4:


Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.




回答5:


My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig



来源:https://stackoverflow.com/questions/5109864/vectorized-trig-functions-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!