Pandas groupby result into multiple columns

前端 未结 3 526
渐次进展
渐次进展 2020-12-31 16:05

I have a dataframe in which I\'m looking to group and then partition the values within a group into multiple columns.

For example: say I have the following dataframe

相关标签:
3条回答
  • 2020-12-31 16:16

    Another way of doing this is to first added a "helper" column on to your data, then pivot your dataframe using the "helper" column, in the case below "ID_Count":

    Using @unutbu setup:

    import pandas as pd
    import numpy as np
    np.random.seed(2016)
    
    df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                       'ID': [1, 2, 3, 4, 5, 6],
                       'Value': np.random.randint(1, 100, 6)})
    #Create group
    grp = df.groupby('Group')
    
    #Create helper column 
    df['ID_Count'] = grp['ID'].cumcount() + 1
    
    #Pivot dataframe using helper column and add 'Value' column to pivoted output.
    df_out = df.pivot('Group','ID_Count','ID').add_prefix('ID').assign(Value = grp['Value'].sum())
    

    Output:

    ID_Count  ID1  ID2  ID3  Value
    Group                         
    A         1.0  4.0  NaN     77
    B         3.0  NaN  NaN     84
    C         2.0  5.0  6.0     86
    
    0 讨论(0)
  • 2020-12-31 16:23

    Using get_dummies and MultiLabelBinarizer (scikit-learn):

    import pandas as pd
    import numpy as np
    from sklearn import preprocessing
    df = pd.DataFrame()
    df['Group']=['A','C','B','A','C','C']
    df['ID']=[1,2,3,4,5,6]
    df['Value']=np.random.randint(1,100,6)
    
    mlb = preprocessing.MultiLabelBinarizer(classes=classes).fit([])
    
    df2 = pd.get_dummies(df, '', '', columns=['ID']).groupby(by='Group').sum()
    df3 = pd.DataFrame(mlb.inverse_transform(df2[df['ID'].unique()].values), index=df2.index)
    df3.columns = ['ID' + str(x + 1) for x in range(df3.shape[0])]
    pd.concat([df3, df2['Value']], axis=1)
    
    
           ID1  ID2  ID3  Value
    Group                      
    A        1    4  NaN     63
    B        3  NaN  NaN     59
    C        2    5    6    230
    
    0 讨论(0)
  • 2020-12-31 16:33

    You could use

    id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
    

    to create id_df without the intermediate result DataFrame.


    import pandas as pd
    import numpy as np
    np.random.seed(2016)
    
    df = pd.DataFrame({'Group': ['A', 'C', 'B', 'A', 'C', 'C'],
                       'ID': [1, 2, 3, 4, 5, 6],
                       'Value': np.random.randint(1, 100, 6)})
    
    grouped = df.groupby('Group')
    values = grouped['Value'].agg('sum')
    id_df = grouped['ID'].apply(lambda x: pd.Series(x.values)).unstack()
    id_df = id_df.rename(columns={i: 'ID{}'.format(i + 1) for i in range(id_df.shape[1])})
    result = pd.concat([id_df, values], axis=1)
    print(result)
    

    yields

           ID1  ID2  ID3  Value
    Group                      
    A        1    4  NaN     77
    B        3  NaN  NaN     84
    C        2    5    6     86
    
    0 讨论(0)
提交回复
热议问题