Pandas dataframe: how to apply describe() to each group and add to new columns?

前端 未结 6 682
执念已碎
执念已碎 2020-12-15 06:20

df:

name score
A      1
A      2
A      3
A      4
A      5
B      2
B      4
B      6 
B      8

Want to get the following new dataframe in

相关标签:
6条回答
  • 2020-12-15 06:25

    there is even a shorter one :)

    print df.groupby('name').describe().unstack(1)
    

    Nothing beats one-liner:

    In [145]:

    print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

    0 讨论(0)
  • 2020-12-15 06:40

    Nothing beats one-liner:

    In [145]:
    
    print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
    
    level_1  25%  50%  75%  count  max  mean  min       std
    name                                                   
    A        2.0    3  4.0      5    5     3    1  1.581139
    B        3.5    5  6.5      4    8     5    2  2.581989
    
    0 讨论(0)
  • 2020-12-15 06:41

    Define some data

    In[1]:
    import pandas as pd
    import io
    
    data = """
    name score
    A      1
    A      2
    A      3
    A      4
    A      5
    B      2
    B      4
    B      6
    B      8
        """
    
    df = pd.read_csv(io.StringIO(data), delimiter='\s+')
    print(df)
    

    .

    Out[1]:
      name  score
    0    A      1
    1    A      2
    2    A      3
    3    A      4
    4    A      5
    5    B      2
    6    B      4
    7    B      6
    8    B      8
    

    Solution

    A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame() to iterate over the results of groupby, and construct the summary stats dataframe on the fly:

    In[2]:
    df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
                             for name, group in df.groupby('name'))
    
    print(df2)
    

    .

    Out[2]:
       count  mean       std  min  25%  50%  75%  max
    A      5     3  1.581139    1  2.0    3  4.0    5
    B      4     5  2.581989    2  3.5    5  6.5    8
    

    Here the squeeze function is squeezing out a dimension, to convert the one-column group summary stats Dataframe into a Series.

    Footnote: A generator expression has the form my_function(a) for a in iterator, or if iterator gives us back two-element tuples, as in the case of groupby: my_function(a,b) for a,b in iterator

    0 讨论(0)
  • 2020-12-15 06:41

    Well I managed to get what you wanted but it doesn't scale very well.

    import pandas as pd
    
    name = ['a','a','a','a','a','b','b','b','b','b']
    score = [1,2,3,4,5,2,4,6,8]
    
    d = pd.DataFrame(zip(name,score), columns=['Name','Score'])
    d = d.groupby('Name').describe()
    d = d.reset_index()
    df2 = pd.DataFrame(zip(d.level_1[8:], list(d.Score)[:8], list(d.Score)[8:]), columns = ['Name','A','B']).T
    
    print df2
    
              0     1         2    3    4    5    6    7
    Name  count  mean       std  min  25%  50%  75%  max
    A         5     3  1.581139    1    2    3    4    5
    B         4     5  2.581989    2  3.5    5  6.5    8
    
    0 讨论(0)
  • 2020-12-15 06:44

    Table is stored in dataframe named df

    df= pd.read_csv(io.StringIO(data),delimiter='\s+')
    

    Just specify column name and describe give you required output. In this way you calculate w.r.t any column

    df.groupby('name')['score'].describe()
    
    0 讨论(0)
  • 2020-12-15 06:44
    import pandas as pd
    import io
    import numpy as np
    
    data = """
    name score
    A      1
    A      2
    A      3
    A      4
    A      5
    B      2
    B      4
    B      6
    B      8
        """
    
    df = pd.read_csv(io.StringIO(data), delimiter='\s+')
    
    df2 = df.groupby('name').describe().reset_index().T.drop('name')
    arr = np.array(df2).reshape((4,8))
    
    df2 = pd.DataFrame(arr[1:], index=['name','A','B'])
    
    print(df2)
    

    That will give you df2 as:

                  0     1        2    3    4    5    6    7
        name  count  mean      std  min  25%  50%  75%  max
        A         5     3  1.58114    1    2    3    4    5
        B         4     5  2.58199    2  3.5    5  6.5    8
    
    0 讨论(0)
提交回复
热议问题