问题
I am looking to use the replace
function in an efficient way in python3. The code I have is achieving the task, but is much too slow, as I am working with a large dataset. Thus, my priority is efficiency over elegancy whenever there is a tradeoff. Here is a toy of what I would like to do:
import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
1st 2nd
0 1 2
1 3 4
2 5 6
idxDict= dict()
idxDict[1] = 'a'
idxDict[3] = 'b'
idxDict[5] = 'c'
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
Which gives
1st 2nd
0 a 2
1 b 4
2 c 6
as I desire, but it takes way too long. What would be the fastest way?
Edit: this is a more focused and clean question than this one, for which the solution is similar.
回答1:
use map to perform a lookup:
In [46]:
df['1st'] = df['1st'].map(idxDict)
df
Out[46]:
1st 2nd
0 a 2
1 b 4
2 c 6
to avoid the situation where there is no valid key you can pass na_action='ignore'
You can also use df['1st'].replace(idxDict)
but to answer you question about efficiency:
timings
In [69]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
1000 loops, best of 3: 1.57 ms per loop
1000 loops, best of 3: 1.08 ms per loop
In [70]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 3.25 ms per loop
So using map
is over 3x faster here
on a larger dataset:
In [3]:
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out[3]:
(30000, 2)
In [4]:
%timeit df['1st'].replace(idxDict)
%timeit df['1st'].map(idxDict)
100 loops, best of 3: 18 ms per loop
100 loops, best of 3: 4.31 ms per loop
In [5]:
%%timeit
for k,v in idxDict.items():
df ['1st'] = df ['1st'].replace(k, v)
100 loops, best of 3: 18.2 ms per loop
For 30K row df, map
is ~4x faster so it scales better than replace
or looping
回答2:
While map is indeed faster, replace was updated in version 19.2 (details here) to improve its speed making the difference significantly less:
In [1]:
import pandas as pd
df = pd.DataFrame([[1,2],[3,4],[5,6]], columns = ['1st', '2nd'])
df = pd.concat([df]*10000, ignore_index=True)
df.shape
Out [1]:
(30000, 2)
In [2]:
idxDict = {1:'a', 3:"b", 5:"c"}
%timeit df['1st'].replace(idxDict, inplace=True)
%timeit df['1st'].update(df['1st'].map(idxDict))
Out [2]:
100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 7.95 ms per loop
Additionally, I modified EdChum's code for map to include update, which, while slower, prevents values not included in an incomplete map from being changed to nans.
来源:https://stackoverflow.com/questions/42012339/using-replace-efficiently-in-pandas