I am getting really weird timings for the following code:
import numpy as np
s = 0
for i in range(10000000):
s += np.float64(1) # replace with np.float32
Summary
If an arithmetic expression contains both numpy and built-in numbers, Python arithmetics works slower. Avoiding this conversion removes almost all of the performance degradation I reported.
Details
Note that in my original code:
s = np.float64(1)
for i in range(10000000):
s = (s + 8) * s % 2399232
the types float and numpy.float64 are mixed up in one expression. Perhaps Python had to convert them all to one type?
s = np.float64(1)
for i in range(10000000):
s = (s + np.float64(8)) * s % np.float64(2399232)
If the runtime is unchanged (rather than increased), it would suggest that's what Python indeed was doing under the hood, explaining the performance drag.
Actually, the runtime fell by 1.5 times! How is it possible? Isn't the worst thing that Python could possibly have to do was these two conversions?
I don't really know. Perhaps Python had to dynamically check what needs to be converted into what, which takes time, and being told what precise conversions to perform makes it faster. Perhaps, some entirely different mechanism is used for arithmetics (which doesn't involve conversions at all), and it happens to be super-slow on mismatched types. Reading numpy source code might help, but it's beyond my skill.
Anyway, now we can obviously speed things up more by moving the conversions out of the loop:
q = np.float64(8)
r = np.float64(2399232)
for i in range(10000000):
s = (s + q) * s % r
As expected, the runtime is reduced substantially: by another 2.3 times.
To be fair, we now need to change the float version slightly, by moving the literal constants out of the loop. This results in a tiny (10%) slowdown.
Accounting for all these changes, the np.float64 version of the code is now only 30% slower than the equivalent float version; the ridiculous 5-fold performance hit is largely gone.
Why do we still see the 30% delay? numpy.float64 numbers take the same amount of space as float, so that won't be the reason. Perhaps the resolution of the arithmetic operators takes longer for user-defined types. Certainly not a major concern.