pandas: complex filter on rows of DataFrame

后端 未结 6 1557
没有蜡笔的小新
没有蜡笔的小新 2020-12-04 15:22

I would like to filter rows by a function of each row, e.g.

def f(row):
  return sin(row[\'velocity\'])/np.prod([\'masses\']) > 5

df = pandas.DataFrame(.         


        
相关标签:
6条回答
  • 2020-12-04 15:29

    The best approach I've found is, instead of using reduce=True to avoid errors for empty df (since this arg is deprecated anyway), just check that df size > 0 before applying the filter:

    def my_filter(row):
        if row.columnA == something:
            return True
    
        return False
    
    if len(df.index) > 0:
        df[df.apply(my_filter, axis=1)]
    
    0 讨论(0)
  • 2020-12-04 15:40

    Specify reduce=True to handle empty DataFrames as well.

    import pandas as pd
    
    t = pd.DataFrame(columns=['a', 'b'])
    t[t.apply(lambda x: x['a'] > 1, axis=1, reduce=True)]
    

    https://crosscompute.com/n/jAbsB6OIm6oCCJX9PBIbY5FECFKCClyV/-/apply-custom-filter-on-rows-of-dataframe

    0 讨论(0)
  • 2020-12-04 15:40

    I canot comment on duckworthd's answer, but it is not perfectly working. It crashes when the dataframe is empty:

    df = pandas.DataFrame(columns=['a', 'b', 'c'])
    df[df.apply(lambda x: x['b'] > x['c'], axis=1)]
    

    Outputs:

    ValueError: Must pass DataFrame with boolean values only
    

    To me it looks like a bug in pandas, since { } is definitively a valid set of boolean values. For a solution refer to Roy Hyunjin Han's answer.

    0 讨论(0)
  • 2020-12-04 15:41

    You can do this using DataFrame.apply, which applies a function along a given axis,

    In [3]: df = pandas.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
    
    In [4]: df
    Out[4]: 
              a         b         c
    0 -0.001968 -1.877945 -1.515674
    1 -0.540628  0.793913 -0.983315
    2 -1.313574  1.946410  0.826350
    3  0.015763 -0.267860 -2.228350
    4  0.563111  1.195459  0.343168
    
    In [6]: df[df.apply(lambda x: x['b'] > x['c'], axis=1)]
    Out[6]: 
              a         b         c
    1 -0.540628  0.793913 -0.983315
    2 -1.313574  1.946410  0.826350
    3  0.015763 -0.267860 -2.228350
    4  0.563111  1.195459  0.343168
    
    0 讨论(0)
  • 2020-12-04 15:46

    Suppose I had a DataFrame as follows:

    In [39]: df
    Out[39]: 
          mass1     mass2  velocity
    0  1.461711 -0.404452  0.722502
    1 -2.169377  1.131037  0.232047
    2  0.009450 -0.868753  0.598470
    3  0.602463  0.299249  0.474564
    4 -0.675339 -0.816702  0.799289
    

    I can use sin and DataFrame.prod to create a boolean mask:

    In [40]: mask = (np.sin(df.velocity) / df.ix[:, 0:2].prod(axis=1)) > 0
    
    In [41]: mask
    Out[41]: 
    0    False
    1    False
    2    False
    3     True
    4     True
    

    Then use the mask to select from the DataFrame:

    In [42]: df[mask]
    Out[42]: 
          mass1     mass2  velocity
    3  0.602463  0.299249  0.474564
    4 -0.675339 -0.816702  0.799289
    
    0 讨论(0)
  • 2020-12-04 15:48

    You can use the loc property for slice you dataframe.

    According documentation, loc can have a callable function as argument.

    In [3]: df = pandas.DataFrame(np.random.randn(5, 3), columns=['a', 'b', 'c'])
    
    In [4]: df
    Out[4]: 
              a         b         c
    0 -0.001968 -1.877945 -1.515674
    1 -0.540628  0.793913 -0.983315
    2 -1.313574  1.946410  0.826350
    3  0.015763 -0.267860 -2.228350
    4  0.563111  1.195459  0.343168
    
    # define lambda function
    In [5]: myfilter = lambda x: x['b'] > x['c']
    
    # use my lambda in loc
    In [6]: df1 = df.loc[fif]
    

    if you want to combine your filter function fif with other filter criteria

    df1 = df.loc[fif].loc[(df.b >= 0.5)]
    
    0 讨论(0)
提交回复
热议问题