问题
- I tried comparing the performance of Pandas and the traditional loop. I realized that with the same input and output, Pandas performed terribly fast calculations compared to the traditional loop.
My code:
#df_1h has been imported before
import time
n = 14
pd.options.display.max_columns = 8
display("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
close = df_1h['close']
start = time.time()
df_1h['sma_14_pandas'] = close.rolling(14).mean()
end = time.time()
display('pandas: {}'.format(end - start))
start = time.time()
df_1h['sma_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['sma_14_loop'][i] = close[i-n+1:i+1].mean()
end = time.time()
display('loop: {}'.format(end - start))
display(df_1h.tail())
Output:
"df_1h's Shape 16598 rows x 15 columns"
'pandas: 0.0030088424682617188'
'loop: 7.2529966831207275'
open_time open high low ... ignore rsi_14 sma_14_pandas sma_14_loop
16593 1.562980e+12 11707.39 11739.90 11606.04 ... 0.0 51.813151 11646.625714 11646.625714
16594 1.562983e+12 11664.32 11712.61 11625.00 ... 0.0 49.952679 11646.834286 11646.834286
16595 1.562987e+12 11632.64 11686.47 11510.00 ... 0.0 47.583619 11643.321429 11643.321429
16596 1.562990e+12 11582.06 11624.04 11500.00 ... 0.0 48.725262 11644.912857 11644.912857
16597 1.562994e+12 11604.96 11660.00 11588.16 ... 0.0 50.797087 11656.723571 11656.723571
5 rows × 15 columns
- Pandas almost faster than 2.5k times!!!
My Questions:
- Is my code wrong?
- If my code is correct, why is Pandas so fast?
- How to define custom functions that run so fast for Pandas?
回答1:
As to your three questions:
- Your code is correct in the sense that it produces the correct result. Explicitely iterating over the rows of a dataframe is as a rule however not so good an idea in terms of performance. Most often the same result can be achieved far more efficiently by pandas methods (as you demonstrated yourself).
- Pandas is so fast because it uses numpy under the hood. Numpy implements highly efficient array operations. Also, the original creator of pandas, Wes McKinney, is kinda obsessed with efficiency and speed.
- Use numpy or other optimized libraries. I recommend reading the Enhancing performance section of the pandas docs. If you can't use built-in pandas methods, if often makes sense to retrieve a numpy respresentation of the dataframe or series (using the
valueattribute orto_numpy()method), do all the calculations on the numpy array and only then store the result back to the dataframe or series.
Why is the loop algorithm so slow?
In your loop algorithm, mean is calculated over 16500 times, each time adding up 14 elements to find the mean. Pandas' rolling method uses a more sophisticated approach, heavily reducing the number of arithmetic operations.
You can achieve similar (and in fact about 3 times better) performance than pandas if you do the calculations in numpy. This is illustrated in the following example:
import pandas as pd
import numpy as np
import time
data = np.random.uniform(10000,15000,16598)
df_1h = pd.DataFrame(data, columns=['Close'])
close = df_1h['Close']
n = 14
print("df_1h's Shape {} rows x {} columns".format(df_1h.shape[0], df_1h.shape[1]))
start = time.time()
df_1h['SMA_14_pandas'] = close.rolling(14).mean()
print('pandas: {}'.format(time.time() - start))
start = time.time()
df_1h['SMA_14_loop'] = np.nan
for i in range(n-1, df_1h.shape[0]):
df_1h['SMA_14_loop'][i] = close[i-n+1:i+1].mean()
print('loop: {}'.format(time.time() - start))
def np_sma(a, n=14) :
ret = np.cumsum(a)
ret[n:] = ret[n:] - ret[:-n]
return np.append([np.nan]*(n-1), ret[n-1:] / n)
start = time.time()
df_1h['SMA_14_np'] = np_sma(close.values)
print('np: {}'.format(time.time() - start))
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_pandas.values, equal_nan=True)
assert np.allclose(df_1h.SMA_14_loop.values, df_1h.SMA_14_np.values, equal_nan=True)
Output:
df_1h's Shape 16598 rows x 1 columns
pandas: 0.0031278133392333984
loop: 7.605962753295898
np: 0.0010571479797363281
来源:https://stackoverflow.com/questions/57018511/why-is-pandas-so-madly-fast-how-to-define-such-functions