Python regex to pick all elements that don't match pattern

我的梦境 提交于 2021-02-10 12:49:10

问题


I asked a similar question yesterday Keep elements with pattern in pandas series without converting them to list and now I am faced with the opposite problem.

I have a pandas dataframe:

import pandas as pd
df = pd.DataFrame(["Air type:1, Space kind:2, water, wood", "berries, something at the start:4, Space blu:3, somethingelse"], columns = ['A'])

and I want to pick all elements that don't have a ":" in them. What I tried is the following regex which seems to be working:

df['new'] = df.A.str.findall('(^|\s)([^:,]+)(,|$)')
    A                                                               new
0   Air type:1, Space kind:2, water, wood                           [( , water, ,), ( , wood, )]
1   berries, something at the start:4, Space blu:3, somethingelse   [(, berries, ,), ( , somethingelse, )]

If I understand this correctly, findall searched for 3 patterns (the ones that I have in parenthesis) and returned as many as it could find in tuples wrapped in a list. Is there a way to avoid this and simply return only the middle pattern? As in for the first row: water, wood for the second row: berries, somethingelse

I also tried the opposite approach:

df.A.str.replace('[^\s,][^:,]+:[^:,]+', '').str.replace('\s*,', '')

which seems to be working close to what I want (only the commas between the patterns are missing) but I am wondering if I am missing something that would make my life easier.


回答1:


You may use this regex code:

>>> df['new'] = df.A.str.findall(r'(?:^|,)([^:,]+)(?=,|$)')
>>> print (df)
                                                   A                        new
0              Air type:1, Space kind:2, water, wood            [ water,  wood]
1  berries, something at the start:4, Space blu:3...  [berries,  somethingelse]

Regex used is:

(?:^|,): Match start or comma

  • ([^:,]+): Match 1+ of any character that is not a : and not a ,
  • (?=,|$): Lookahead to assert that we have either a , or end of line ahead



回答2:


You can use the following regex which use non-capturing group (?:) :

df.A.str.findall(r'(?:^|\s)([^:,]{2,})(?:,|$)')

This returns the following output:

Name: A, dtype: object
0               [water, wood]
1    [berries, somethingelse]


来源:https://stackoverflow.com/questions/64981401/python-regex-to-pick-all-elements-that-dont-match-pattern

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!