问题
I have a data frame and I want to select the rows that match some criteria. The criteria is a function of values of other columns and some additional values.
Here is a toy example:
>>df = pd.DataFrame({'A': [1,2,3,4,5,6,7,8,9],
'B': [randint(1,9) for x in xrange(9)],
'C': [4,10,3,5,4,5,3,7,1]})
>>
A B C
0 1 6 4
1 2 8 10
2 3 8 3
3 4 4 5
4 5 2 4
5 6 1 5
6 7 1 3
7 8 2 7
8 9 8 1
and I want select all rows for which some function returns true, e.g. f(a,c,L) returns true iff the multiplication AxC is in the specified lists L, say L=[4,20,30] (though the function could be a less trivial one). That is, I want to get:
>>
A B C
0 1 6 4
1 2 8 10
3 4 4 5
4 5 2 4
5 6 1 5
Similarly, I'd like to add a forth, binary, column 'matched' which is True is AxC in L:
A B C matched
0 1 2 4 True
1 2 5 10 True
2 3 6 3 False
3 4 3 5 True
4 5 2 4 True
5 6 6 5 True
6 7 4 3 False
7 8 5 7 False
8 9 2 1 False
(once this column is added you can easily select all the lines with the True, but I suspect that once you can add you could also select).
Is there an efficient and elegant way to do it without explicitly iterating all indices? Thanks!
回答1:
A vectorised solution using isin:
In [5]:
L=[4,20,30]
df['Match'] = (df['A']*df['C']).isin(L)
df
Out[5]:
A B C Match
0 1 6 4 True
1 2 1 10 True
2 3 8 3 False
3 4 4 5 True
4 5 2 4 True
5 6 4 5 True
6 7 4 3 False
7 8 7 7 False
8 9 4 1 False
Timings:
In [9]:
%%timeit
L=[4,20,30]
rowindex = df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)
df.loc[rowindex,'match'] = True
df.loc[~rowindex,'match'] = False
100 loops, best of 3: 3.13 ms per loop
In [11]:
%%timeit
L=[4,20,30]
df['Match'] = (df['A']*df['C']).isin(L)
1000 loops, best of 3: 678 µs per loop
回答2:
This will return a boolean index
L=[4,20,30]
df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 False
7 False
8 False
Which you could then do
rowindex = df.apply(lambda x : True if (x['A'] * x['C']) in L else False, axis=1)
df.loc[rowindex,'match'] = True
df.loc[~rowindex,'match'] = False
df
A B C match
0 1 7 4 True
1 2 3 10 True
2 3 9 3 False
3 4 5 5 True
4 5 9 4 True
5 6 2 5 True
6 7 2 3 False
7 8 7 7 False
8 9 6 1 False
来源:https://stackoverflow.com/questions/27911546/how-to-select-add-a-column-to-pandas-dataframe-based-on-a-function-of-other-colu