pandas: complex filter on rows of DataFrame

后端未结

关注

 6  1557

I would like to filter rows by a function of each row, e.g.

def f(row):
  return sin(row[\'velocity\'])/np.prod([\'masses\']) > 5

df = pandas.DataFrame(.


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2020-12-04 15:29
              
            
            
                                                                       
The best approach I've found is, instead of using reduce=True to avoid errors for empty df (since this arg is deprecated anyway), just check that df size > 0 before applying the filter:

def my_filter(row):
    if row.columnA == something:
        return True

    return False

if len(df.index) > 0:
    df[df.apply(my_filter, axis=1)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2020-12-04 15:40
              
            
            
                                                                       
Specify reduce=True to handle empty DataFrames as well.

import pandas as pd

t = pd.DataFrame(columns=['a', 'b'])
t[t.apply(lambda x: x['a'] > 1, axis=1, reduce=True)]


https://crosscompute.com/n/jAbsB6OIm6oCCJX9PBIbY5FECFKCClyV/-/apply-custom-filter-on-rows-of-dataframe
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  伪装坚强ぢ        
                
              
                            
                2020-12-04 15:40
              
            
            
                                                                       
I canot comment on duckworthd's answer, but it is not perfectly working. It crashes when the dataframe is empty: 

df = pandas.DataFrame(columns=['a', 'b', 'c'])
df[df.apply(lambda x: x['b'] > x['c'], axis=1)]


Outputs: 

ValueError: Must pass DataFrame with boolean values only


To me it looks like a bug in pandas, since { } is definitively a valid set of boolean values. For a solution refer to Roy Hyunjin Han's answer.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2020-12-04 15:41
              
            
            
                                                                       
You can do this using DataFrame.apply, which applies a function along a given axis,

In [3]: df = pandas.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])

In [4]: df
Out[4]: 
          a         b         c
0 -0.001968 -1.877945 -1.515674
1 -0.540628  0.793913 -0.983315
2 -1.313574  1.946410  0.826350
3  0.015763 -0.267860 -2.228350
4  0.563111  1.195459  0.343168

In [6]: df[df.apply(lambda x: x['b'] > x['c'], axis=1)]
Out[6]: 
          a         b         c
1 -0.540628  0.793913 -0.983315
2 -1.313574  1.946410  0.826350
3  0.015763 -0.267860 -2.228350
4  0.563111  1.195459  0.343168

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-12-04 15:46
              
            
            
                                                                       
Suppose I had a DataFrame as follows:

In [39]: df
Out[39]: 
      mass1     mass2  velocity
0  1.461711 -0.404452  0.722502
1 -2.169377  1.131037  0.232047
2  0.009450 -0.868753  0.598470
3  0.602463  0.299249  0.474564
4 -0.675339 -0.816702  0.799289


I can use sin and DataFrame.prod to create a boolean mask:

In [40]: mask = (np.sin(df.velocity) / df.ix[:, 0:2].prod(axis=1)) > 0

In [41]: mask
Out[41]: 
0    False
1    False
2    False
3     True
4     True


Then use the mask to select from the DataFrame:

In [42]: df[mask]
Out[42]: 
      mass1     mass2  velocity
3  0.602463  0.299249  0.474564
4 -0.675339 -0.816702  0.799289

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-04 15:48
              
            
            
                                                                       
You can use the loc property for slice you dataframe.

According  documentation, 
loc can have a callable function as argument.

In [3]: df = pandas.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])

In [4]: df
Out[4]: 
          a         b         c
0 -0.001968 -1.877945 -1.515674
1 -0.540628  0.793913 -0.983315
2 -1.313574  1.946410  0.826350
3  0.015763 -0.267860 -2.228350
4  0.563111  1.195459  0.343168

# define lambda function
In [5]: myfilter = lambda x: x['b'] > x['c']

# use my lambda in loc
In [6]: df1 = df.loc[fif]


if you want to combine your filter function fif with other filter criteria

df1 = df.loc[fif].loc[(df.b >= 0.5)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复