Why is group_by -> filter -> summarise faster in R than pandas?

五迷三道 提交于 2019-12-11 05:13:21

问题


I am converting some of our older codes from R to python. In the process, have found pandas to be a bit slower than R. Interested in knowing if there is anything wrong I am doing.

R Code (Taking around 2ms on my system):

df = data.frame(col_a = sample(letters[1:3],20,T),
           col_b = sample(1:2,20,T),
             col_c = sample(letters[1:2],20,T),
             col_d = sample(c(4,2),20,T)
             )

microbenchmark::microbenchmark(
a = df %>% 
  group_by(col_a, col_b) %>% 
  summarise(
    a = sum(col_c == 'a'),
    b = sum(col_c == 'b'),
    c = a/b
  ) %>% 
  ungroup()
)

pandas (taking 10ms on my system):

df = pd.DataFrame({
    'col_a': np.random.choice(['a','b','c'],N),
    'col_b': np.random.choice([1,2],N),
    'col_c': np.random.choice(['a', 'b'],N),
    'col_d': np.random.choice(['4', '2'],N),
})
%%timeit 
df1 = df.groupby(['col_a', 'col_b']).agg({
    'col_c':[
        ('a',lambda x: (x=='a').sum()),
        ('b',lambda x: (x=='b').sum())
    ]}).reset_index()
df1['rat'] = df1.col_c.a/df1.col_c.b

回答1:


This isn't a technical answer, but it's worth noting that there are a lot of different ways to accomplish this operation in Pandas, and some are faster than others. For example, the Pandas code below gets the values you're looking for (albeit with some ugly MultiIndex columns) in about 5ms:

df.groupby(['col_a', 'col_b', 'col_c'])\
  .count()\
  .unstack()\
  .assign(rat = lambda x: x.col_d.a/x.col_d.b)

4.96 ms ± 169 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Aside from any under the hood speed ups, I think the main speed advantage of tidyverse syntax vs Pandas here is that summarise() will make each new variable immediately available, within the same call, which avoids having to spread the counts and then compute rat.

If there's an analog to that in Pandas, I don't know it. The closest thing is either pipe() or the use of lambda within assign(). Each new function call in the chain takes time to execute, so Pandas ends up being slower.



来源:https://stackoverflow.com/questions/56419400/why-is-group-by-filter-summarise-faster-in-r-than-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!