How many FLOPs does tanh need?

后端未结

关注

 2  615

星月不相逢 2021-02-14 18:15

I would like to compute how many flops each layer of LeNet-5 (paper) needs. Some papers give FLOPs for other architectures in total (1, 2, 3) However, those papers don\'t give d

2条回答

半阙折子戏 (楼主)

2021-02-14 18:39
If we look at the glibc-implementation of tanh(x), we see:
1. for x values greater 22.0 and double precision, tanh(x) can be safely assumed to be 1.0, so there are almost no costs.
2. for very small x, (let's say x<2^(-55)) another cheap approximation is possible: tanh(x)=x(1+x), so only two floating point operations are needed.
3. for the values in beetween, one can rewrite tanh(x)=(1-exp(-2x))/(1+exp(-2x)). However, one must be accurate, because 1-exp(t) is very problematic for small t-values due to loss of significance, so one uses expm(x)=exp(x)-1 and calculates tanh(x)=-expm1(-2x)/(expm1(-2x)+2).
So basically, the worst case is about 2 times the number of operation needed for expm1, which is a pretty complicated function. The best way is probably just to measure the time needed to calculate tanh(x) compared with a time needed for a simple multiplication of two doubles.

My (sloppy) experiments on an Intel-processor yielded the following result, which gives a rough idea:

So for very small and numbers >22 there are almost no costs, for numbers up to 0.1 we pay 6 FLOPS, then the costs rise to about 20 FLOPS per tanh-caclulation.

The key takeaway: the costs of calculating tanh(x) are dependent on the parameter x and maximal costs are somewhere between 10 and 100 FLOPs.

There is an Intel-instruction called F2XM1 which computes 2^x-1 for -1.0, which could be used for computing tanh, at least for some range. However if agner's tables are to be believed, this operation's costs are about 60 FLOPs.
Another problem is the vectorization - the normal glibc-implementation is not vectorized, as far as I can see. So if your program uses vectorization and has to use an unvectorized tanh implementation it will slowdown the program even more. For this, the intel compiler has the mkl-library which vectorizes tanh among the others. As you can see in the tables the maximal costs are about 10 clocks per operation (costs of a float-operation is about 1 clock). I guess there is a chance you could win some FLOPs by using -ffast-math compiler option, which results in a faster but less precise program (that is an option for Cuda or c/c++, not sure whether this can be done for python/numpy). The c++ code which produced the data for the figure (compiled with g++ -std=c++11 -O2). Its intend is not to give the exact number, but the first impression about the costs: #include #include #include #include int main(){ const std::vector starts={1e-30, 1e-18, 1e-16, 1e-10, 1e-5, 1e-2, 1e-1, 0.5, 0.7, 0.9, 1.0, 2.0, 10, 20, 23, 100,1e3, 1e4}; const double FACTOR=1.0+1e-11; const size_t ITER=100000000; //warm-up: double res=1.0; for(size_t i=0;i<4*ITER;i++){ res*=FACTOR; } //overhead: auto begin = std::chrono::high_resolution_clock::now(); for(size_t i=0;i(end-begin).count(); //std::cout<<"overhead: "<(end-begin).count(); std::cout<(end-begin).count(); std::cerr<<"overhead check: "<
0 讨论(0) 查看其它2个回答发布评论: 提交评论加载中...