Python Pandas : pivot table with aggfunc = count unique distinct

后端 未结 8 1742
谎友^
谎友^ 2020-12-07 13:02
df2 = pd.DataFrame({\'X\' : [\'X1\', \'X1\', \'X1\', \'X1\'], \'Y\' : [\'Y2\',\'Y1\',\'Y1\',\'Y1\'], \'Z\' : [\'Z3\',\'Z1\',\'Z1\',\'Z2\']})

    X   Y   Z
0  X1  Y2         


        
相关标签:
8条回答
  • 2020-12-07 13:29

    Since none of the answers are up to date with the last version of Pandas, I am writing another solution for this problem:

    In [1]:
    import pandas as pd
    
    # Set exemple
    df2 = pd.DataFrame({'X' : ['X1', 'X1', 'X1', 'X1'], 'Y' : ['Y2','Y1','Y1','Y1'], 'Z' : ['Z3','Z1','Z1','Z2']})
    
    # Pivot
    pd.crosstab(index=df2['Y'], columns=df2['Z'], values=df2['X'], aggfunc=pd.Series.nunique)
    
    Out [1]:
    Z   Z1  Z2  Z3
    Y           
    Y1  1.0 1.0 NaN
    Y2  NaN NaN 1.0
    
    0 讨论(0)
  • 2020-12-07 13:38

    aggfunc=pd.Series.nunique will only count unique values for a series - in this case count the unique values for a column. But this doesn't quite reflect as an alternative to aggfunc='count'

    For simple counting, it better to use aggfunc=pd.Series.count

    0 讨论(0)
  • 2020-12-07 13:40

    Do you mean something like this?

    In [39]: df2.pivot_table(values='X', rows='Y', cols='Z', 
                             aggfunc=lambda x: len(x.unique()))
    Out[39]: 
    Z   Z1  Z2  Z3
    Y             
    Y1   1   1 NaN
    Y2 NaN NaN   1
    

    Note that using len assumes you don't have NAs in your DataFrame. You can do x.value_counts().count() or len(x.dropna().unique()) otherwise.

    0 讨论(0)
  • 2020-12-07 13:42

    This is a good way of counting entries within .pivot_table:

    df2.pivot_table(values='X', index=['Y','Z'], columns='X', aggfunc='count')
    
    
            X1  X2
    Y   Z       
    Y1  Z1   1   1
        Z2   1  NaN
    Y2  Z3   1  NaN
    
    0 讨论(0)
  • 2020-12-07 13:42

    aggfunc=pd.Series.nunique provides distinct count.

    Full Code:

    df2.pivot_table(values='X', rows='Y', cols='Z', 
                             aggfunc=pd.Series.nunique)
    

    Credit to @hume for this solution (see comment under the accepted answer). Adding as an answer here for better discoverability.

    0 讨论(0)
  • 2020-12-07 13:44

    For best performance I recommend doing DataFrame.drop_duplicates followed up aggfunc='count'.

    Others are correct that aggfunc=pd.Series.nunique will work. This can be slow, however, if the number of index groups you have is large (>1000).

    So instead of (to quote @Javier)

    df2.pivot_table('X', 'Y', 'Z', aggfunc=pd.Series.nunique)
    

    I suggest

    df2.drop_duplicates(['X', 'Y', 'Z']).pivot_table('X', 'Y', 'Z', aggfunc='count')
    

    This works because it guarantees that every subgroup (each combination of ('Y', 'Z')) will have unique (non-duplicate) values of 'X'.

    0 讨论(0)
提交回复
热议问题