GroupBy pandas DataFrame and select most common value

后端 未结 10 1828
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

10条回答
  •  不知归路
    2020-11-22 08:40

    Pandas >= 0.16

    pd.Series.mode is available!

    Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

    source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    If this is needed as a DataFrame, use

    source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
    
                             Short name
    Country City                       
    Russia  Sankt-Petersburg        Spb
    USA     New-York                 NY
    

    The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

    # Accepted answer.
    %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
    # Proposed in this post.
    %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Dealing with Multiple Modes

    Series.mode also does a good job when there are multiple modes:

    source2 = source.append(
        pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
        ignore_index=True)
    
    # Now `source2` has two modes for the 
    # ("USA", "New-York") group, they are "NY" and "New".
    source2
    
      Country              City Short name
    0     USA          New-York         NY
    1     USA          New-York        New
    2  Russia  Sankt-Petersburg        Spb
    3     USA          New-York         NY
    4     USA          New-York        New
    

    source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    Country  City            
    Russia   Sankt-Petersburg          Spb
    USA      New-York            [NY, New]
    Name: Short name, dtype: object
    

    Or, if you want a separate row for each mode, you can use GroupBy.apply:

    source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
    
    Country  City               
    Russia   Sankt-Petersburg  0    Spb
    USA      New-York          0     NY
                               1    New
    Name: Short name, dtype: object
    

    If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

    source2.groupby(['Country','City'])['Short name'].agg(
        lambda x: pd.Series.mode(x)[0])
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    Alternatives to (not) consider

    You can also use statistics.mode from python, but...

    source.groupby(['Country','City'])['Short name'].apply(statistics.mode)
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    ...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

    If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

    But you can see for yourself...

    statistics.mode([1, 2])
    # ---------------------------------------------------------------------------
    # StatisticsError                           Traceback (most recent call last)
    # ...
    # StatisticsError: no unique mode; found 2 equally common values
    

提交回复
热议问题