Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.