问题
my initial dataframe looks like this
A | B
-----------------
'a' | ['1', 'a', 'b']
'1' | ['2', '5', '6']
'd' | ['a', 'b', 'd']
'y' | ['x', '1', 'y']
and I want to check if 'a' is in the corresponding list in B: ['1', 'a', 'b']
I could do that by using the apply
df.apply(lambda row: row[['A']][0] in row[['B']][0], axis=1)
that gives me the expected result:
[True, False, True, True]
but on the real data I have (millions of rows) that is very heavy and takes ages. Is there a more efficient way to do the same thing? for example using numpy elementwise operations or anything else?
回答1:
If you convert each column to sets, you can use <
to compare pairwise subsets
a = d.A.apply(lambda x: set([x]))
b = d.B.apply(set)
a < b
0 True
1 False
2 True
3 True
dtype: bool
Otherwise, you can use a list comprehension with zip
[a in b for a, b in zip(d.A.values.tolist(), d.B.values.tolist())]
[True, False, True, True]
timing small data
timing large data
来源:https://stackoverflow.com/questions/43553523/pandas-efficient-way-to-check-if-a-value-in-column-a-is-in-a-list-of-values-in