I have just noticed that the execution time of a script of mine nearly halves by only changing a multiplication to a division.
To investigate this, I have written a smal
This answer only looks at vectorised operations, as the reason for the other operations being slow has been answered by ead.
A lot of "optimisations" are based on old hardware. The assumptions that meant that optimisations held true on older hardware do not old true on newer hardware.
Division is slow. Division operations consist of several units that each have to perform one calculation one after another. This is what makes division slow.
However, in a floating-point processing unit (FPU) [common on most modern CPUs] there are dedicated units arranged in a "pipeline" for the division instruction. Once a unit is done, that unit isn't needed for the rest of the operation. If you have several division operations you can get these units with nothing to do started on the next division operation. So though each operation is slow, the FPU can actually achieve a high throughput of division operations. Pipeline-ing isn't the same as vectorisation, but the results are mostly the same -- higher throughput when you have lots of the same operations to do.
Think of pipeline-ing like traffic. Compare three lanes of traffic moving at 30 mph versus one lane of traffic moving at 90 mph. The slower traffic is definitely slower individually, but the three-lane-road still has the same throughput.