Converting a Pandas GroupBy output from Series to DataFrame

后端 未结 9 682
广开言路
广开言路 2020-11-22 09:58

I\'m starting with input data like this

df1 = pandas.DataFrame( { 
    \"Name\" : [\"Alice\", \"Bob\", \"Mallory\", \"Mallory\", \"Bob\" , \"Mallory\"] , 
           


        
9条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-22 10:49

    These solutions only partially worked for me because I was doing multiple aggregations. Here is a sample output of my grouped by that I wanted to convert to a dataframe:

    Because I wanted more than the count provided by reset_index(), I wrote a manual method for converting the image above into a dataframe. I understand this is not the most pythonic/pandas way of doing this as it is quite verbose and explicit, but it was all I needed. Basically, use the reset_index() method explained above to start a "scaffolding" dataframe, then loop through the group pairings in the grouped dataframe, retrieve the indices, perform your calculations against the ungrouped dataframe, and set the value in your new aggregated dataframe.

    df_grouped = df[['Salary Basis', 'Job Title', 'Hourly Rate', 'Male Count', 'Female Count']]
    df_grouped = df_grouped.groupby(['Salary Basis', 'Job Title'], as_index=False)
    
    # Grouped gives us the indices we want for each grouping
    # We cannot convert a groupedby object back to a dataframe, so we need to do it manually
    # Create a new dataframe to work against
    df_aggregated = df_grouped.size().to_frame('Total Count').reset_index()
    df_aggregated['Male Count'] = 0
    df_aggregated['Female Count'] = 0
    df_aggregated['Job Rate'] = 0
    
    def manualAggregations(indices_array):
        temp_df = df.iloc[indices_array]
        return {
            'Male Count': temp_df['Male Count'].sum(),
            'Female Count': temp_df['Female Count'].sum(),
            'Job Rate': temp_df['Hourly Rate'].max()
        }
    
    for name, group in df_grouped:
        ix = df_grouped.indices[name]
        calcDict = manualAggregations(ix)
    
        for key in calcDict:
            #Salary Basis, Job Title
            columns = list(name)
            df_aggregated.loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                              (df_aggregated['Job Title'] == columns[1]), key] = calcDict[key]
    

    If a dictionary isn't your thing, the calculations could be applied inline in the for loop:

        df_aggregated['Male Count'].loc[(df_aggregated['Salary Basis'] == columns[0]) & 
                                    (df_aggregated['Job Title'] == columns[1])] = df['Male Count'].iloc[ix].sum()
    

提交回复
热议问题