问题
I have identified one pandas command
timeseries.loc[z, x] = y
to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.
Here is my code (ignore the top half, it is just the timing helper)
def populateTimeseriesTable(df, observable, timeseries):
"""
Go through all rows of df and
put the observable into the timeseries
at correct row (symbol), column (tsMean).
"""
print "len(df.index)=", len(df.index) # show number of rows
global bf, t
bf = time.time() # set 'before' to now
t = dict([(i,0) for i in range(5)]) # fill category timing with zeros
def T(i):
"""
timing helper: Add passed time to category 'i'. Then set 'before' to now.
"""
global bf, t
t[i] = t[i] + (time.time()-bf)
bf = time.time()
for i in df.index: # this is the slow loop
bf = time.time()
sym = df["symbol"][i]
T(0)
tsMean = df["tsMean"][i]
T(1)
tsMean = tsFormatter(tsMean)
T(2)
o = df[observable][i]
T(3)
timeseries.loc[sym, tsMean] = o
T(4)
from pprint import pprint
print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
pprint (t)
return timeseries
With (not important, not slow)
def tsFormatter(ts):
"as human readable string, only up to whole seconds"
return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))
. .
--> The to-be-optimized code is in the for-loop.
(T, and t are just helper function & dict, for the timing.)
I have timed every step. The vast majority of time:
len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
1: 0.741,
2: 0.243,
3: 0.792,
4: 17.371}
is spent in the last step
timeseries.loc[sym, tsMean] = o
I have already downloaded and install pypy - but sadly, that doesn't support pandas yet.
Any ideas how to speed up populating a 2D array?
Thanks!
Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:
timeseries = pd.DataFrame({"name": titles}, index=index)
回答1:
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
=====================================================================
@jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):
setup:
In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))
In [16]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
a float64
b float64
c float64
d float64
e float64
dtypes: float64(5)
memory usage: 381.5 MB
In [17]: df.shape
Out[17]: (10000000, 5)
Timing:
In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
1000 loops, best of 3: 502 µs per loop
In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
1000 loops, best of 3: 394 µs per loop
In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 66.8 µs per loop
In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
10000 loops, best of 3: 32.9 µs per loop
In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 64.8 µs per loop
In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
1000 loops, best of 3: 503 µs per loop
Results as a bar plot:
Timing data as DF:
In [88]: r
Out[88]:
method timing
0 loc 502.0
1 iloc 394.0
2 at 66.8
3 iat 32.9
4 ix_label 64.8
5 ix_integer 503.0
In [89]: r.to_dict()
Out[89]:
{'method': {0: 'loc',
1: 'iloc',
2: 'at',
3: 'iat',
4: 'ix_label',
5: 'ix_integer'},
'timing': {0: 502.0,
1: 394.0,
2: 66.799999999999997,
3: 32.899999999999999,
4: 64.799999999999997,
5: 503.0}}
Plotting
ax = sns.barplot(data=r, x='method', y='timing')
ax.tick_params(labelsize=16)
[ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
ax.set_xlabel('indexing method', size=20)
ax.set_ylabel('timing (microseconds)', size=20)
回答2:
I always think at is the fastest, but not. ix is faster:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
print (df.at[2, 'B'])
6
print (df.ix[2, 'B'])
6
print (df.loc[2, 'B'])
6
In [77]: %timeit df.at[2, 'B']
10000 loops, best of 3: 44.6 µs per loop
In [78]: %timeit df.ix[2, 'B']
10000 loops, best of 3: 40.7 µs per loop
In [79]: %timeit df.loc[2, 'B']
1000 loops, best of 3: 681 µs per loop
EDIT:
I try MaxU df and differences are caused random.randint function:
df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))
In [4]: %timeit (df.ix[2, 'B'])
The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.7 µs per loop
In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28 µs per loop
回答3:
if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.
If you plan to do thins inside a big loop(say 10M records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times
来源:https://stackoverflow.com/questions/37757844/pandas-df-locz-x-y-how-to-improve-speed