Frequency tables in pandas (like plyr in R)

后端 未结 4 2227
不思量自难忘°
不思量自难忘° 2020-12-28 18:53

My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :

d1 = pd.DataFrame( {\'StudentID\': [\"x1\", \"x10         


        
相关标签:
4条回答
  • 2020-12-28 19:21

    This:

    d1.groupby('ExamenYear').agg({'Participated': len, 
                                  'Passed': lambda x: sum(x == 'yes')})
    

    doesn't look way more awkward than the R solution, IMHO.

    0 讨论(0)
  • 2020-12-28 19:23

    You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,

    > import pandas as pd
    > pd.crosstab(d1['ExamenYear'], d1['Passed'])
    Passed      no  yes
    ExamenYear         
    2007         1    2
    2008         1    3
    2009         1    2
    

    Use the margins=True option if you also want to see the subtotal of each row and column.

    > pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
    Participated  no  yes  All
    ExamenYear                
    2007           1    2    3
    2008           1    3    4
    2009           0    3    3
    All            2    8   10
    
    0 讨论(0)
  • 2020-12-28 19:27

    There is another approach that I like to use for similar problems, it uses groupby and unstack:

    d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                       'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                       'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                       'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                       'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                       'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                      columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])
    

    (this is just the raw data from above)

    d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
    d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
    d2.name = "Participated"
    d3.name = "Passed"
    
    pd.DataFrame(data=[d2,d3]).T
                Participated  Passed
    ExamenYear                      
    2007                   2       2
    2008                   3       3
    2009                   3       2
    

    This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.

    0 讨论(0)
  • 2020-12-28 19:41

    I finally decided to use apply.

    I am posting what I came up with hoping that it can be useful for others.

    From what I understand from Wes' book "Python for Data analysis"

    • apply is more flexible than agg and transform because you can define your own function.
    • the only requirement is that the functions returns a pandas object or a scalar value.
    • the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
    • One needs to "hard-code" structure you want at the end

    Here is what I came up with

    def ZahlOccurence_0(x):
          return pd.Series({'All': len(x['StudentID']),
                           'Part': sum(x['Participated'] == 'yes'),
                           'Pass' :  sum(x['Passed'] == 'yes')})
    

    when I run it :

     d1.groupby('ExamenYear').apply(ZahlOccurence_0)
    

    I get the correct results

                All  Part  Pass
    ExamenYear                 
    2007          3     2     2
    2008          4     3     3
    2009          3     3     2
    

    This approach would also allow me to combine frequencies with other stats

    import numpy as np
    d1['testValue'] = np.random.randn(len(d1))
    
    def ZahlOccurence_1(x):
        return pd.Series({'All': len(x['StudentID']),
            'Part': sum(x['Participated'] == 'yes'),
            'Pass' :  sum(x['Passed'] == 'yes'),
            'test' : x['testValue'].mean()})
    
    
    d1.groupby('ExamenYear').apply(ZahlOccurence_1)
    
    
                All  Part  Pass      test
    ExamenYear                           
    2007          3     2     2  0.358702
    2008          4     3     3  1.004504
    2009          3     3     2  0.521511
    

    I hope someone else will find this useful

    0 讨论(0)
提交回复
热议问题