Faster way to rank rows in subgroups in pandas dataframe

前端 未结 2 1565
误落风尘
误落风尘 2020-12-07 22:56

I have a pandas data frame that has is composed of different subgroups.

    df = pd.DataFrame({
    \'id\':[1, 2, 3, 4, 5, 6, 7, 8], 
    \'group\':[\'a\',          


        
相关标签:
2条回答
  • 2020-12-07 23:48

    rank is cythonized so should be very fast. And you can pass the same options as df.rank() here are the docs for rank. As you can see, tie-breaks can be done in one of five different ways via the method argument.

    Its also possible you simply want the .cumcount() of the group.

    In [12]: df.groupby('group')['value'].rank(ascending=False)
    Out[12]: 
    0    4
    1    1
    2    3
    3    2
    4    3
    5    2
    6    1
    7    4
    dtype: float64
    
    0 讨论(0)
  • 2020-12-07 23:53

    Working with a big DataFrame (13 million lines), the method rank with groupby maxed out my 8GB of RAM an it took a really long time. I found a workaround less greedy in memory , that I put here just in case:

    df.sort_values('value')
    tmp = df.groupby('group').size()
    rank = tmp.map(range)
    rank =[item for sublist in rank for item in sublist]
    df['rank'] = rank
    
    0 讨论(0)
提交回复
热议问题