How to apply “first” and “last” functions to columns while using group by in pandas?

后端 未结 4 589
太阳男子
太阳男子 2020-12-08 06:53

I have a data frame and I would like to group it by a particular column (or, in other words, by values from a particular column). I can do it in the following way: gro

相关标签:
4条回答
  • 2020-12-08 07:09

    Instead of using first or last, use their string representations in the agg method. For example on the OP's case:

    grouped = df.groupby(['ColumnName'])
    grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
    
    #you can do the string representation for first and last
    grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})
    
    0 讨论(0)
  • 2020-12-08 07:10

    I would use a custom aggregator as shown below.

    d = pd.DataFrame([[1,"man"], [1, "woman"], [1, "girl"], [2,"man"], [2, "woman"]],columns = 'number family'.split())
    d
    

    Here is the output:

        number family
     0       1    man
     1       1  woman
     2       1   girl
     3       2    man
     4       2  woman
    

    Now the Aggregation taking first and last elements.

    d.groupby(by = "number").agg(firstFamily= ('family', lambda x: list(x)[0]), lastFamily =('family', lambda x: list(x)[-1]))
    

    The output of this aggregation is shown below.

           firstFamily lastFamily
    number                       
    1              man       girl
    2              man      woman
    

    I hope this helps.

    0 讨论(0)
  • 2020-12-08 07:14

    I'm not sure if this is really the issue, but sum and min are Python built-ins that take some iterables as input, whereas first is a method of pandas Series object, so maybe it's not in your namespace. Moreover it takes something else as an input (the doc says some offset value).

    I guess one way to get around it is to create your own first function, and define it such that it takes a Series object as an input, e.g.:

    def first(Series, offset):
        return Series.first(offset)
    

    or something like that..

    0 讨论(0)
  • 2020-12-08 07:18

    I think the issue is that there are two different first methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).

    To replicate the behaviour of the groupby first method over a DataFrame using agg you could use iloc[0] (which gets the first row in each group (DataFrame/Series) by index):

    grouped.agg(lambda x: x.iloc[0])
    

    For example:

    In [1]: df = pd.DataFrame([[1, 2], [3, 4]])
    
    In [2]: g = df.groupby(0)
    
    In [3]: g.first()
    Out[3]: 
       1
    0   
    1  2
    3  4
    
    In [4]: g.agg(lambda x: x.iloc[0])
    Out[4]: 
       1
    0   
    1  2
    3  4
    

    Analogously you can replicate last using iloc[-1].

    Note: This will works column-wise, et al:

    g.agg({1: lambda x: x.iloc[0]})
    

    In older version of pandas you could would use the irow method (e.g. x.irow(0), see previous edits.


    A couple of updated notes:

    This is better done using the nth groupby method, which is much faster >=0.13:

    g.nth(0)  # first
    g.nth(-1)  # last
    

    You have to take care a little, as the default behaviour for first and last ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna option for nth.

    You can use the strings rather than built-ins (though IIRC pandas spots it's the sum builtin and applies np.sum):

    grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
    
    0 讨论(0)
提交回复
热议问题