In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I
Transpose just changes the strides, it doesn't touch the actual array. I think the reason why sum
etc. along the final axis is recommended (I'd like to see the source for that, btw.) is that when an array is C-ordered, walking along the final axis preserves locality of reference. That won't be the case after you transpose, since the transposed array will be Fortran-ordered.