Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-29 05:21:37

Well, you are not really timing the same things here (or rather, you are timing different aspects).

E.g.

In [6]:    x = Series(np.random.randn(nobs))

In [7]:    y = Series(np.random.randn(nobs))

In [8]:  %timeit x + y
10000 loops, best of 3: 131 µs per loop

In [9]:  %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop

So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation

Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).

So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.

Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.

Following up on @Jeff answer. The code can be further optimized.

nobs = 10000
x = pd.Series(np.random.randn(nobs))
y = pd.Series(np.random.randn(nobs))

%timeit proc_ser()
%timeit x + y
%timeit x.values + y.values

100 loops, best of 3: 11.8 ms per loop
10000 loops, best of 3: 107 µs per loop
100000 loops, best of 3: 12.3 µs per loop
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!