Frequency tables in pandas (like plyr in R)

后端未结

关注

 4  2229

不思量自难忘° 2020-12-28 18:53

My problem is how to calculate frequencies on multiple variables in pandas . I have from this dataframe :

d1 = pd.DataFrame( {\'StudentID\': [\"x1\", \"x10


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   感情败类
                                             
                
                
                (楼主)
            
              
              
                2020-12-28 19:41
              

            
            
                        
I finally decided to use apply.

I am posting what I came up with hoping that it can be useful for others. 

From what I understand from Wes' book "Python for Data analysis" 


apply is more flexible than agg and transform because you can define your own function. 
the only requirement is that the functions returns a pandas object or a scalar value.
the inner mechanics: the function is called on each piece of the grouped object abd results are glued together using pandas.concat
One needs to "hard-code" structure you want at the end


Here is what I came up with 

def ZahlOccurence_0(x):
      return pd.Series({'All': len(x['StudentID']),
                       'Part': sum(x['Participated'] == 'yes'),
                       'Pass' :  sum(x['Passed'] == 'yes')})


when I run it :     

 d1.groupby('ExamenYear').apply(ZahlOccurence_0)


I get the correct results 

            All  Part  Pass
ExamenYear                 
2007          3     2     2
2008          4     3     3
2009          3     3     2


This approach would also allow me to combine frequencies with other stats

import numpy as np
d1['testValue'] = np.random.randn(len(d1))

def ZahlOccurence_1(x):
    return pd.Series({'All': len(x['StudentID']),
        'Part': sum(x['Participated'] == 'yes'),
        'Pass' :  sum(x['Passed'] == 'yes'),
        'test' : x['testValue'].mean()})


d1.groupby('ExamenYear').apply(ZahlOccurence_1)


            All  Part  Pass      test
ExamenYear                           
2007          3     2     2  0.358702
2008          4     3     3  1.004504
2009          3     3     2  0.521511


I hope someone else will find this useful 
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复