When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruct
Using a description largely based on Intel Pentium architecture, to cut a very very long story short:
Since the timing of an instruction depends on the surrounding instructions, in practice, it's usually best to time a representative piece of code than try and worry about individual instructions. However:
So for example, if, say, floating point add and multiply instructions each have a throughput of 2 and a latency of 5 (actually, for multiply it's a bit greater I think), that means that adding a register to itself or multiplying it by itself will likely take two clock cycles (since there are no other dependent values), whereas adding it the result of a previous multiplication will take something like or a bit less than 2+5 clock cycles, depending where you start/finish timing, and on all sorts of other things. (During some of those clock cycles, another add/multiply operation could be taking place, so it's arguable how many cycles you actually attribute to the individual add/mutliply instructions anyway...)
Oh, and just as a concrete example. For following Java code
public void runTest(double[] data, double randomVal) {
for (int i = data.length-1; i >= 0; i--) {
data[i] = data[i] + randomVal;
}
}
Hotspot 1.6.12 JIT-compiles the inner loop sequence to the following Intel code, consisting of a load-add-store for each position in the array (with 'randomVal' being held in XMM0a in this case):
0b3 MOVSD XMM1a,[EBP + #16]
0b8 ADDSD XMM1a,XMM0a
0bc MOVSD [EBP + #16],XMM1a
0c1 MOVSD XMM1a,[EBP + #8]
0c6 ADDSD XMM1a,XMM0a
0ca MOVSD [EBP + #8],XMM1a
...
each group of load-add-store appears to take 5 clock cycles.