I am interested in obtaining the number of nano seconds it would take to execute 1 double precision FLOP on GeForce GTX 550 Ti.
In order to do that I am following this
Compute capability 2.1 devices has a double precision throughput of 4 operations per cycle (8 if doing DFMA). This assumes all 32 threads are active in the dispatched warp.
4 ops/cycle/SM * 4 SMs * 1800 MHz * 2 ops/DFMA = 56 GFLOPS double
The calculation assumes all threads in a warp are active.
The code in your question contains two dependent operations that could be fused into a DFMA. Use cuobjdump -sass to examine the assembly. If you launch multiple warps on the same SM the test turns into a measure of dependent instruction throughput not latency.