Frequency tables in pandas (like plyr in R)

后端未结

关注

 4  2234

My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :

d1 = pd.DataFrame( {\'StudentID\': [\"x1\", \"x10


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2020-12-28 19:21
              
            
            
                                                                       
This:

d1.groupby('ExamenYear').agg({'Participated': len, 
                              'Passed': lambda x: sum(x == 'yes')})


doesn't look way more awkward than the R solution, IMHO.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  Happy的楠姐        
                
              
                            
                2020-12-28 19:23
              
            
            
                                                                       
You may use pandas crosstab function, which by default computes a frequency table of two or more variables. For example,

> import pandas as pd
> pd.crosstab(d1['ExamenYear'], d1['Passed'])
Passed      no  yes
ExamenYear         
2007         1    2
2008         1    3
2009         1    2


Use the margins=True option if you also want to see the subtotal of each row and column.

> pd.crosstab(d1['ExamenYear'], d1['Participated'], margins=True)
Participated  no  yes  All
ExamenYear                
2007           1    2    3
2008           1    3    4
2009           0    3    3
All            2    8   10

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  攒了一身酷        
                
              
                            
                2020-12-28 19:27
              
            
            
                                                                       
There is another approach that I like to use for similar problems, it uses groupby and unstack:

d1 = pd.DataFrame({'StudentID': ["x1", "x10", "x2","x3", "x4", "x5", "x6",   "x7",     "x8", "x9"],
                   'StudentGender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
                   'ExamenYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
                   'Exam': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
                   'Participated': ['no','yes','yes','yes','no','yes','yes','yes','yes','yes'],
                   'Passed': ['no','yes','yes','yes','no','yes','yes','yes','no','yes']},
                  columns = ['StudentID', 'StudentGender', 'ExamenYear', 'Exam', 'Participated', 'Passed'])


(this is just the raw data from above)

d2 = d1.groupby("ExamenYear").Participated.value_counts().unstack(fill_value=0)['yes']
d3 = d1.groupby("ExamenYear").Passed.value_counts().unstack(fill_value=0)['yes']
d2.name = "Participated"
d3.name = "Passed"

pd.DataFrame(data=[d2,d3]).T
            Participated  Passed
ExamenYear                      
2007                   2       2
2008                   3       3
2009                   3       2


This solution is slightly more cumbersome than the one above using apply, but this one is easier to understand and extend, I feel.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-28 19:41
              
            
            
                                                                       
I finally decided to use apply.

I am posting what I came up with hoping that it can be useful for others. 

From what I understand from Wes' book "Python for Data analysis" 


apply is more flexible than agg and transform because you can define your own function. 
the only requirement is that the functions returns a pandas object or a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end


Here is what I came up with 

def ZahlOccurence_0(x):
      return pd.Series({'All': len(x['StudentID']),
                       'Part': sum(x['Participated'] == 'yes'),
                       'Pass' :  sum(x['Passed'] == 'yes')})


when I run it :     

 d1.groupby('ExamenYear').apply(ZahlOccurence_0)


I get the correct results 

            All  Part  Pass
ExamenYear                 
2007          3     2     2
2008          4     3     3
2009          3     3     2


This approach would also allow me to combine frequencies with other stats

import numpy as np
d1['testValue'] = np.random.randn(len(d1))

def ZahlOccurence_1(x):
    return pd.Series({'All': len(x['StudentID']),
        'Part': sum(x['Participated'] == 'yes'),
        'Pass' :  sum(x['Passed'] == 'yes'),
        'test' : x['testValue'].mean()})


d1.groupby('ExamenYear').apply(ZahlOccurence_1)


            All  Part  Pass      test
ExamenYear                           
2007          3     2     2  0.358702
2008          4     3     3  1.004504
2009          3     3     2  0.521511


I hope someone else will find this useful 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复