问题
If I have a pandas dataframe that looks like this:
Sequence Rating
0 HYHIVQKF 1
1 YGEIFEKF 2
2 TYGGSWKF 3
3 YLESFYKF 4
4 YYNTAVKL 5
5 WPDVIHSF 6
This is the code that I am using the return the rows that match the following pattern:
\b.[YF]\w+[LFI]\b
pat = r'\b.[YF]\w+[LFI]\b'
new_df.Sequence.str.contains(pat)
new_df[new_df.Sequence.str.contains(pat)]
The above code is returning the rows that match the pattern, but what can I use to return the unmatched rows?
Expected Output:
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6
回答1:
You can just do a negation of your existing Boolean series:
df[~df.Sequence.str.contains(pat)]
This will give you the desired output:
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6
Brief explanation:
df.Sequence.str.contains(pat)
will return a Boolean series:
0 True
1 False
2 True
3 False
4 True
5 False
Name: Sequence, dtype: bool
Negating it using ~
yields
~df.Sequence.str.contains(pat)
0 False
1 True
2 False
3 True
4 False
5 True
Name: Sequence, dtype: bool
which is another Boolean series you can pass to your original dataframe.
回答2:
You can use ~
for not:
pat = r'\b.[YF]\w+[LFI]\b'
new_df[~new_df.Sequence.str.contains(pat)]
# Sequence Rating
#1 YGEIFEKF 2
#3 YLESFYKF 4
#5 WPDVIHSF 6
回答3:
Psidom's answer is more elegant, but another way to solve this problem is to modify the regex pattern to use a negative lookahead assertion, and then use match()
instead of contains()
:
pat = r'\b.[YF]\w+[LFI]\b'
not_pat = r'(?!{})'.format(pat)
>>> new_df[new_df.Sequence.str.match(pat)]
Sequence Rating
0 HYHIVQKF 1
2 TYGGSWKF 3
4 YYNTAVKL 5
>>> new_df[new_df.Sequence.str.match(not_pat)]
Sequence Rating
1 YGEIFEKF 2
3 YLESFYKF 4
5 WPDVIHSF 6
来源:https://stackoverflow.com/questions/45557050/return-the-unmatched-rows-from-the-regex-pattern