df.groupby(…).agg(set) produces different result compared to df.groupby(…).agg(lambda x: set(x))

后端未结

关注

 2  1234

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

相关标签:

2条回答

情歌与酒

2021-02-07 08:47

Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

df.groupby('user_id').agg(print,end='\n\n')

 class_type instructor  user_id
0  Krav Maga        Bob        1
4   Ju-jitsu      Alice        1

  class_type instructor  user_id
1       Yoga      Alice        2
5  Krav Maga      Alice        2

  class_type instructor  user_id
2   Ju-jitsu        Bob        3
6     Karate        Bob        3


df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))

0    Krav Maga
4     Ju-jitsu
Name: class_type, dtype: object

1         Yoga
5    Krav Maga
Name: class_type, dtype: object

2    Ju-jitsu
6      Karate
Name: class_type, dtype: object

3    Krav Maga
Name: class_type, dtype: object

...

Hope this is the reason why applying set gave the result like the one mentioned above.

0 讨论(0)

忘了有多久

2021-02-07 08:48

OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

elif is_list_like(arg) and arg not in compat.string_types:

see source

this isn't is_list_like so it returns None up the call chain to end up at this line:

results.append(colg.aggregate(a))

see source

this raises TypeError as TypeError: 'type' object is not iterable

which then raises:

if not len(results):
    raise ValueError("no results")

see source

so because we have no results we end up calling _aggregate_generic:

see source

this then calls:

result[name] = self._try_cast(func(data, *args, **kwargs)

see source

This then ends up as:

(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)

(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}

I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

In [8]:

df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]: 
        class_type instructor
user_id                      
1             None       None
2             None       None
3             None       None
4             None       None

but when you use the lambda which is an anonymous function this works as expected.

0 讨论(0)