问题
I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.
When I compile a block
for(int i=0; i<SIZE; i++) {
arr[i]=sin((float)i/1024);
}
GCC won't vectorize it, and says
not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);
Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.
With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.
A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?
EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.
回答1:
Since you said you were using GCC it looks like there are some options:
- http://gruntthepeon.free.fr/ssemath/
- This uses SSE and SSE2 instructions to implement it.
- http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
- This has an alternate implementation. Some of the comments are pretty good.
That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.
- https://code.google.com/p/slmath/
- https://code.google.com/p/thrust/
回答2:
Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.
回答3:
What platform are you using? Many libraries of this sort already exist:
- Intel's provides the Vector Math Library (VML) with icc.
- Apple provides the vForce library as part of the Accelerate framework.
- HP provides their own Vector Math Library for Itanium (and may other architectures, too).
- Sun provided libmvec with their compiler tools.
- ...
回答4:
Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.
回答5:
My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig
来源:https://stackoverflow.com/questions/5109864/vectorized-trig-functions-in-c