Return the unmatched rows from the regex pattern

问题

If I have a pandas dataframe that looks like this:

      Sequence     Rating
 0    HYHIVQKF     1
 1    YGEIFEKF     2
 2    TYGGSWKF     3
 3    YLESFYKF     4
 4    YYNTAVKL     5
 5    WPDVIHSF     6

This is the code that I am using the return the rows that match the following pattern: \b.[YF]\w+[LFI]\b

pat = r'\b.[YF]\w+[LFI]\b'
new_df.Sequence.str.contains(pat)

new_df[new_df.Sequence.str.contains(pat)]

The above code is returning the rows that match the pattern, but what can I use to return the unmatched rows?

Expected Output:

     Sequence  Rating
1    YGEIFEKF   2
3    YLESFYKF   4
5    WPDVIHSF   6

回答1:

You can just do a negation of your existing Boolean series:

df[~df.Sequence.str.contains(pat)]

This will give you the desired output:

   Sequence  Rating
1  YGEIFEKF       2
3  YLESFYKF       4
5  WPDVIHSF       6

Brief explanation:

df.Sequence.str.contains(pat)

will return a Boolean series:

0     True
1    False
2     True
3    False
4     True
5    False
Name: Sequence, dtype: bool

Negating it using ~ yields

~df.Sequence.str.contains(pat)

0    False
1     True
2    False
3     True
4    False
5     True
Name: Sequence, dtype: bool

which is another Boolean series you can pass to your original dataframe.

回答2:

You can use ~ for not:

pat = r'\b.[YF]\w+[LFI]\b'
new_df[~new_df.Sequence.str.contains(pat)]

#   Sequence    Rating
#1  YGEIFEKF    2
#3  YLESFYKF    4
#5  WPDVIHSF    6

回答3:

Psidom's answer is more elegant, but another way to solve this problem is to modify the regex pattern to use a negative lookahead assertion, and then use match() instead of contains():

pat = r'\b.[YF]\w+[LFI]\b'
not_pat = r'(?!{})'.format(pat)

>>> new_df[new_df.Sequence.str.match(pat)]
   Sequence  Rating
0  HYHIVQKF       1
2  TYGGSWKF       3
4  YYNTAVKL       5

>>> new_df[new_df.Sequence.str.match(not_pat)]
   Sequence  Rating
1  YGEIFEKF       2
3  YLESFYKF       4
5  WPDVIHSF       6

来源：https://stackoverflow.com/questions/45557050/return-the-unmatched-rows-from-the-regex-pattern

标签

python

regex

python-3.x

pandas

series