Finding max occurrence of a column's value, after group-by on another column

老子叫甜甜 提交于 2020-05-27 06:45:07

问题


I have a pandas data-frame:

        id                city
 000.tushar@gmail.com   Bangalore
 00078r@gmail.com       Mumbai
0007ayan@gmail.com      Jamshedpur
0007ayan@gmail.com      Jamshedpur
000.tushar@gmail.com    Bangalore
  00078r@gmail.com      Mumbai
  00078r@gmail.com      Vijayawada
  00078r@gmail.com      Vijayawada
  00078r@gmail.com      Vijayawada

I want to find id-wise the maximum occurring city name. So that for a given id I can tell that - this is his favorite city:

         id             city
000.tushar@gmail.com   Bangalore
00078r@gmail.com       Vijayawada
0007ayan@gmail.com     Jamshedpur

Using groupby id and city gives:

         id                   city       count
0  000.tushar@gmail.com       Bangalore    2
1      00078r@gmail.com        Mumbai      2
2      00078r@gmail.com      Vijayawada    3
3    0007ayan@gmail.com      Jamshedpur    2

How to proceed further? I believe some group-by apply will do that but unaware of what exactly will do the trick. So please suggest.

If some id has same count for two or three cities I am ok with returning any of those cities.


回答1:


You can try double groupby with size and idxmax. Output is list of tuples (because MultiIndex), so use apply:

df = df.groupby(['id','city']).size().groupby(level=0).idxmax()
                              .apply(lambda x: x[1]).reset_index(name='city')

Another solutions:

s = df.groupby(['id','city']).size()
df = s.loc[s.groupby(level=0).idxmax()].reset_index().drop(0,axis=1)

Or:

df = df.groupby(['id'])['city'].apply(lambda x: x.value_counts().index[0]).reset_index()

print (df)
                     id        city
0  000.tushar@gmail.com   Bangalore
1      00078r@gmail.com  Vijayawada
2    0007ayan@gmail.com  Jamshedpur



回答2:


The recommended approach is groupby('id').apply(your_custom_function), where your_custom_function aggregates by 'city' and returns the max value (or as you mentioned, multiple max values). We don't even have to use .agg('city')

import pandas as pd

def get_top_city(g):
    return g['city'].value_counts().idxmax()    

df = pd.DataFrame.from_records(
         [('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com',     'Mumbai'),
         ('0007ayan@gmail.com',   'Jamshedpur'),('0007ayan@gmail.com',   'Jamshedpur'),
         ('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com',     'Mumbai'),
         ('00078r@gmail.com',     'Vijayawada'),('00078r@gmail.com',     'Vijayawada'),
         ('00078r@gmail.com',     'Vijayawada')],
         columns=['id','city'],
         index=None
     )

topdf = df.groupby('id').apply(get_top_city)

id
000.tushar@gmail.com     Bangalore
00078r@gmail.com        Vijayawada
0007ayan@gmail.com      Jamshedpur

# or topdf.items()/iteritems() if you want as list of (id,city) tuples

[('000.tushar@gmail.com', 'Bangalore'), ('00078r@gmail.com', 'Vijayawada'), ('0007ayan@gmail.com', 'Jamshedpur')]


来源:https://stackoverflow.com/questions/36174624/finding-max-occurrence-of-a-columns-value-after-group-by-on-another-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!