how to select/add a column to pandas dataframe based on a function of other columns?

问题

I have a data frame and I want to select the rows that match some criteria. The criteria is a function of values of other columns and some additional values.

Here is a toy example:

>>df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
                   'B': [randint(1,9) for x in xrange(9)],
                   'C': [4,10,3,5,4,5,3,7,1]})
>>

      A  B   C
   0  1  6   4
   1  2  8  10
   2  3  8   3
   3  4  4   5
   4  5  2   4
   5  6  1   5
   6  7  1   3
   7  8  2   7
   8  9  8   1

and I want select all rows for which some function returns true, e.g. f(a,c,L) returns true iff the multiplication AxC is in the specified lists L, say L=[4,20,30] (though the function could be a less trivial one). That is, I want to get:

>>
      A  B   C
   0  1  6   4
   1  2  8  10
   3  4  4   5
   4  5  2   4
   5  6  1   5

Similarly, I'd like to add a forth, binary, column 'matched' which is True is AxC in L:

      A  B   C  matched
   0  1  2   4    True
   1  2  5  10    True
   2  3  6   3   False
   3  4  3   5    True
   4  5  2   4    True
   5  6  6   5    True
   6  7  4   3   False
   7  8  5   7   False
   8  9  2   1   False

(once this column is added you can easily select all the lines with the True, but I suspect that once you can add you could also select).

Is there an efficient and elegant way to do it without explicitly iterating all indices? Thanks!

回答1:

A vectorised solution using isin:

In [5]:

L=[4,20,30]
df['Match'] = (df['A']*df['C']).isin(L)
df
Out[5]:
   A  B   C  Match
0  1  6   4   True
1  2  1  10   True
2  3  8   3  False
3  4  4   5   True
4  5  2   4   True
5  6  4   5   True
6  7  4   3  False
7  8  7   7  False
8  9  4   1  False

Timings:

In [9]:

%%timeit
L=[4,20,30]
rowindex = df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)
df.loc[rowindex,'match'] = True
df.loc[~rowindex,'match'] = False
100 loops, best of 3: 3.13 ms per loop
In [11]:

%%timeit 
L=[4,20,30]
df['Match'] = (df['A']*df['C']).isin(L)

1000 loops, best of 3: 678 µs per loop

回答2:

This will return a boolean index

L=[4,20,30]
df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)

0     True
1     True
2    False
3     True
4     True
5     True
6    False
7    False
8    False

Which you could then do

rowindex = df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)
df.loc[rowindex,'match'] = True
df.loc[~rowindex,'match'] = False
df

    A   B   C   match
0   1   7   4   True
1   2   3   10  True
2   3   9   3   False
3   4   5   5   True
4   5   9   4   True
5   6   2   5   True
6   7   2   3   False
7   8   7   7   False
8   9   6   1   False

来源：https://stackoverflow.com/questions/27911546/how-to-select-add-a-column-to-pandas-dataframe-based-on-a-function-of-other-colu

标签

function

select

pandas

add

dataframe