Pandas transform() vs apply()

我与影子孤独终老i 提交于 2019-12-21 07:26:28

问题


I don't understand why apply and transform return different dtypes when called on the same data frame. The way I explained the two functions to myself before went something along the lines of "apply collapses the data, and transform does exactly the same thing as apply but preserves the original index and doesn't collapse." Consider the following.

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [1,1,0,0,1,0,0,0,0,1]})

Let's identify those ids which have a nonzero entry in the cat column.

>>> df.groupby('id')['cat'].apply(lambda x: (x == 1).any())
id
1     True
2     True
3    False
4     True
Name: cat, dtype: bool

Great. If we wanted to create an indicator column, however, we could do the following.

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

I don't understand why the dtype is now int64 instead of the boolean returned by the any() function.

When I change the original data frame to contain some booleans (note that the zeros remain), the transform approach returns booleans in an object column. This is an extra mystery to me since all of the values are boolean, but it's listed as object apparently to match the dtype of the original mixed-type column of integers and booleans.

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,0,0,True,0,0,0,0,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: object

However, when I use all booleans, the transform function returns a boolean column.

df = pd.DataFrame({'id': [1,1,1,2,2,2,2,3,3,4],
                   'cat': [True,True,False,False,True,False,False,False,False,True]})

>>> df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7    False
8    False
9     True
Name: cat, dtype: bool

Using my acute pattern-recognition skills, it appears that the dtype of the resulting column mirrors that of the original column. I would appreciate any hints about why this occurs or what's going on under the hood in the transform function. Cheers.


回答1:


It looks like SeriesGroupBy.transform() tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform() doesn't seem to do that:

In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    0
8    0
9    1
Name: cat, dtype: int64

#                         v       v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
     cat
0   True
1   True
2   True
3   True
4   True
5   True
6   True
7  False
8  False
9   True

In [141]: df.dtypes
Out[141]:
cat    int64
id     int64
dtype: object



回答2:


Just adding another illustrative example with sum as I find it more explicit:

df = (
    pd.DataFrame(pd.np.random.rand(10, 3), columns=['a', 'b', 'c'])
        .assign(a=lambda df: df.a > 0.5)
)

Out[70]: 
       a         b         c
0  False  0.126448  0.487302
1  False  0.615451  0.735246
2  False  0.314604  0.585689
3  False  0.442784  0.626908
4  False  0.706729  0.508398
5  False  0.847688  0.300392
6  False  0.596089  0.414652
7  False  0.039695  0.965996
8   True  0.489024  0.161974
9  False  0.928978  0.332414

df.groupby('a').apply(sum)  # drop rows

         a         b         c
a                             
False  0.0  4.618465  4.956997
True   1.0  0.489024  0.161974


df.groupby('a').transform(sum)  # keep dims

          b         c
0  4.618465  4.956997
1  4.618465  4.956997
2  4.618465  4.956997
3  4.618465  4.956997
4  4.618465  4.956997
5  4.618465  4.956997
6  4.618465  4.956997
7  4.618465  4.956997
8  0.489024  0.161974
9  4.618465  4.956997

However when applied to pd.DataFrame and not pd.GroupBy object I was not able to see any difference.



来源:https://stackoverflow.com/questions/41476436/pandas-transform-vs-apply

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!