GroupBy pandas DataFrame and select most common value

南楼画角 提交于 2019-11-25 23:31:31

问题


I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd
from scipy import stats

source = pd.DataFrame({\'Country\' : [\'USA\', \'USA\', \'Russia\',\'USA\'], 
                  \'City\' : [\'New-York\', \'New-York\', \'Sankt-Petersburg\', \'New-York\'],
                  \'Short name\' : [\'NY\',\'New\',\'Spb\',\'NY\']})

print source.groupby([\'Country\',\'City\']).agg(lambda x: stats.mode(x[\'Short name\'])[0])

Last line of code doesn\'t work, it says \"Key error \'Short name\'\" and if I try to group only by City, then I got an AssertionError. What can I do fix it?


回答1:


You can use value_counts() to get a count series, and get the first row:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])



回答2:


(updated for)

Pandas >= 0.16

pd.Series.mode is available.

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer.
%timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
# Proposed in this post.
%timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Series.mode also does a good job when there are multiple modes:

source2 = source.append(
    pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
    ignore_index=True)

# Now `source2` has two modes for the 
# ("USA", "New-York") group, they are "NY" and "New".
source2

  Country              City Short name
0     USA          New-York         NY
1     USA          New-York        New
2  Russia  Sankt-Petersburg        Spb
3     USA          New-York         NY
4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)

Country  City            
Russia   Sankt-Petersburg          Spb
USA      New-York            [NY, New]
Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)

Country  City               
Russia   Sankt-Petersburg  0    Spb
USA      New-York          0     NY
                           1    New
Name: Short name, dtype: object

If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(
    lambda x: pd.Series.mode(x)[0])

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

You can also use statistics.mode from python, but...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)

Country  City            
Russia   Sankt-Petersburg    Spb
USA      New-York             NY
Name: Short name, dtype: object

...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself...

statistics.mode([1, 2])
# ---------------------------------------------------------------------------
# StatisticsError                           Traceback (most recent call last)
# ...
# StatisticsError: no unique mode; found 2 equally common values



回答3:


For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

With these two simple changements:

source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])

returns

                         Short name
Country City                       
Russia  Sankt-Petersburg        Spb
USA     New-York                 NY



回答4:


A little late to the game here, but I was running into some performance issues with HYRY's solution, so I had to come up with another one.

It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

There's also an additional solution that supports multiple modes.

On a scale test that's representative of the data I'm working with, this reduced runtime from 37.4s to 0.5s!

Here's the code for the solution, some example usage, and the scale test:

import numpy as np
import pandas as pd
import random
import time

test_input = pd.DataFrame(columns=[ 'key',          'value'],
                          data=  [[ 1,              'A'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              'B'    ],
                                  [ 1,              np.nan ],
                                  [ 2,              np.nan ],
                                  [ 3,              'C'    ],
                                  [ 3,              'C'    ],
                                  [ 3,              'D'    ],
                                  [ 3,              'D'    ]])

def mode(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         

    The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
    (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
    contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
    group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

def modes(df, key_cols, value_col, count_col):
    '''                                                                                                                                                                                                                                                                                                                                                              
    Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
    for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
    that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         

    The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
    that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
    `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        

    The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
    one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
    columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
    and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
    '''
    return df.groupby(key_cols + [value_col]).size() \
             .to_frame(count_col).reset_index() \
             .groupby(key_cols + [count_col])[value_col].unique() \
             .to_frame().reset_index() \
             .sort_values(count_col, ascending=False) \
             .drop_duplicates(subset=key_cols)

print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')

scale_test_data = [[random.randint(1, 100000),
                    str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                data=scale_test_data)

start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start

start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start

Running this code will print something like:

   key value
0    1     A
1    1     B
2    1     B
3    1   NaN
4    2   NaN
5    3     C
6    3     C
7    3     D
8    3     D
   key value  count
1    1     B      2
2    3     C      2
   key  count   value
1    1      2     [B]
2    3      2  [C, D]
0.489614009857
9.19386196136
37.4375009537

Hope this helps!




回答5:


Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

import pandas as pd
df = pd.DataFrame({'client' : ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E','E','E','A'], 'total' : [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 'bla':[10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]})

If you compute like @HYRY you obtain:

df.groupby(['socio']).agg(lambda x: x.value_counts().index[0])

and you obtain:

Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.

Thus, the other solution is correct:

import scipy.stats
df3.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0])

getting:




回答6:


The two top answers here suggest:

df.groupby(cols).agg(lambda x:x.value_counts().index[0])

or, preferably

df.groupby(cols).agg(pd.Series.mode)

However both of these fail in simple edge cases, as demonstrated here:

df = pd.DataFrame({
    'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
    'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
    'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
})

The first:

df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])

yields IndexError (because of the empty Series returned by group C). The second:

df.groupby(['client_id', 'date']).agg(pd.Series.mode)

returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

Two possible solutions for this case are:

import scipy
x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])

And the solution given to me by cs95 in the comments here:

def foo(x): 
    m = pd.Series.mode(x); 
    return m.values[0] if not m.empty else np.nan
df.groupby(['client_id', 'date']).agg(foo)

However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33's answer (which should be higher):

def get_mode_per_column(dataframe, group_cols, col):
    return (dataframe.fillna(-1)  # NaN placeholder to keep group 
            .groupby(group_cols + [col])
            .size()
            .to_frame('count')
            .reset_index()
            .sort_values('count', ascending=False)
            .drop_duplicates(subset=group_cols)
            .drop(columns=['count'])
            .sort_values(group_cols)
            .replace(-1, np.NaN))  # restore NaNs

group_cols = ['client_id', 'date']    
non_grp_cols = list(set(df).difference(group_cols))
output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
for col in non_grp_cols[1:]:
    output_df[col] = get_mode_per_column(df, group_cols, col)[col].values

Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.




回答7:


The problem here is the performance, if you have a lot of rows it will be a problem.

If it is your case, please try with this:

import pandas as pd

source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])

source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()



回答8:


If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection

from collections import Counter
get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]

Which can be applied to the above example like this

src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
              'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
              'Short_name' : ['NY','New','Spb','NY']})

src.groupby(['Country','City']).agg(get_most_common)



回答9:


A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

>>> import pandas as pd
>>> source = pd.DataFrame(
        {
            'Country': ['USA', 'USA', 'Russia', 'USA'], 
            'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
            'Short name': ['NY', 'New', 'Spb', 'NY']
        }
    )
>>> grouped_df = source\
        .groupby(['Country','City','Short name'])[['Short name']]\
        .count()\
        .rename(columns={'Short name':'count'})\
        .reset_index()\
        .sort_values('count', ascending=False)\
        .drop_duplicates(subset=['Country', 'City'])\
        .drop('count', axis=1)
>>> print(grouped_df)
  Country              City Short name
1     USA          New-York         NY
0  Russia  Sankt-Petersburg        Spb


来源:https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!