I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it
The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.