Apply vs transform on a group object

前端 未结 4 1415
别跟我提以往
别跟我提以往 2020-11-22 15:04

Consider the following dataframe:

     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.8338         


        
4条回答
  •  时光取名叫无心
    2020-11-22 15:14

    As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
    

    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

    So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

    Consider this example (on your dataframe):

    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
    df.groupby('A').transform(zscore)
    

    will yield:

           C      D
    0  0.989  0.128
    1 -0.478  0.489
    2  0.889 -0.589
    3 -0.671 -1.150
    4  0.034 -0.285
    5  1.149  0.662
    6 -1.404 -0.907
    7 -0.509  1.653
    

    Which is exactly the same as if you would use it on only on one column at a time:

    df.groupby('A')['C'].transform(zscore)
    

    yielding:

    0    0.989
    1   -0.478
    2    0.889
    3   -0.671
    4    0.034
    5    1.149
    6   -1.404
    7   -0.509
    

    Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

    df.groupby('A').apply(zscore)
    

    gives error:

    ValueError: operands could not be broadcast together with shapes (6,) (2,)
    

    So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
    

    yielding:

         A      B      C      D  sum_C
    1  bar    one  1.998  0.593  3.973
    3  bar  three  1.287 -0.639  3.973
    5  bar    two  0.687 -1.027  3.973
    4  foo    two  0.205  1.274  4.373
    2  foo    two  0.128  0.924  4.373
    6  foo    one  2.113 -0.516  4.373
    7  foo  three  0.657 -1.179  4.373
    0  foo    one  1.270  0.201  4.373
    

    Trying the same with .apply would give NaNs in sum_C. Because .apply would return a reduced Series, which it does not know how to broadcast back:

    df.groupby('A')['C'].apply(sum)
    

    giving:

    A
    bar    3.973
    foo    4.373
    

    There are also cases when .transform is used to filter the data:

    df[df.groupby(['B'])['D'].transform(sum) < -1]
    
         A      B      C      D
    3  bar  three  1.287 -0.639
    7  foo  three  0.657 -1.179
    

    I hope this adds a bit more clarity.

提交回复
热议问题