问题
I am converting some of our older codes from R to python. In the process, have found pandas to be a bit slower than R. Interested in knowing if there is anything wrong I am doing.
R Code (Taking around 2ms on my system):
df = data.frame(col_a = sample(letters[1:3],20,T),
col_b = sample(1:2,20,T),
col_c = sample(letters[1:2],20,T),
col_d = sample(c(4,2),20,T)
)
microbenchmark::microbenchmark(
a = df %>%
group_by(col_a, col_b) %>%
summarise(
a = sum(col_c == 'a'),
b = sum(col_c == 'b'),
c = a/b
) %>%
ungroup()
)
pandas (taking 10ms on my system):
df = pd.DataFrame({
'col_a': np.random.choice(['a','b','c'],N),
'col_b': np.random.choice([1,2],N),
'col_c': np.random.choice(['a', 'b'],N),
'col_d': np.random.choice(['4', '2'],N),
})
%%timeit
df1 = df.groupby(['col_a', 'col_b']).agg({
'col_c':[
('a',lambda x: (x=='a').sum()),
('b',lambda x: (x=='b').sum())
]}).reset_index()
df1['rat'] = df1.col_c.a/df1.col_c.b
回答1:
This isn't a technical answer, but it's worth noting that there are a lot of different ways to accomplish this operation in Pandas, and some are faster than others. For example, the Pandas code below gets the values you're looking for (albeit with some ugly MultiIndex columns) in about 5ms:
df.groupby(['col_a', 'col_b', 'col_c'])\
.count()\
.unstack()\
.assign(rat = lambda x: x.col_d.a/x.col_d.b)
4.96 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Aside from any under the hood speed ups, I think the main speed advantage of tidyverse
syntax vs Pandas here is that summarise()
will make each new variable immediately available, within the same call, which avoids having to spread
the counts and then compute rat
.
If there's an analog to that in Pandas, I don't know it. The closest thing is either pipe()
or the use of lambda
within assign()
. Each new function call in the chain takes time to execute, so Pandas ends up being slower.
来源:https://stackoverflow.com/questions/56419400/why-is-group-by-filter-summarise-faster-in-r-than-pandas