My guess is that the __no_operation() intrinsic (ARM) instruction should take 1/(168 MHz) to execute, provided that each NOP executes in o
Because pipelining affects perceived execution time, a single instruction will measure differently than a sequence of the same instruction.
You could measure the timing of the scenario you care about using the built-in cycle-counting register, as discussed in your other post here.
Similarly, you might try using and reg, reg instead of nop, since Cortex F4 may not behave as you expect, using nop instructions.