My guess is that the __no_operation() intrinsic (ARM) instruction should take 1/(168 MHz) to execute, provided that each NOP executes in o
If you carefully configure all your clocks in the Reset and Clock Control (RCT) and you know all the clocks you can exactly calculate the instruction execution time for most of the instructions and have at least a worst case evaluation for all of them. For example I'm using a stm32f439Zi processor, which is a cortex-m4 compatible with the stm32f407. If you look at the reference manual the clock tree is showing you the PLL and all buss prescalers. In my case I have a 8 MHz external quarts with PLL configured to provide 84 Mhz system clock SYSCLK. That means that one processor cycle is 1.0/84e6 ~ 12 ns.
For reference of the how many cycles or SYSCLK one instruction takes you are using the ARM® Cortex®‑M4 Processor Technical Reference Manual. For example the MOV instruction in most of the cases takes a cycle. ADD instruction in most of the cases takes a cycle, which means that after 12 ns you have the result of the addition stored in the register and ready for a use by another operation.
You can use that information to schedule your processor resources in many cases, such as periodic interrupts for instance, and the electrical and the low-level embedded system software developers are talking about that and are doing that when it comes to strict real-time and safety critical systems. Normally engineers are working with the worst case execution time during the design ignoring the pipeline to have a quick and rough inside of the processor load. At the implementation you are using tools for precise time analysis and refine the software.
In the process of the design and implementation the non-deterministic things are reduced to negligible.