How get all matches using str.contains in python regex?

自作多情 提交于 2021-02-05 08:10:55

问题


I have a data frame, in which I need to find all the possible matches rows which match with terms. My code is

texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','foo baz']
# create df
df = pd.DataFrame({'Match_text': texts})
#cretae pattern 
pat = r'\b(?:{})\b'.format('|'.join(terms))
# use str.contains to find matchs
df = df[df['Match_text'].str.contains(pat)]

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]
df['results'] = results

The output is

Match_text  results
0   foo abc     [foo]
3   baz 45      [baz]
6   foo baz     [foo, baz]

In which, foo baz is also matching with row 6 along with foo, and baz. I need to get rows for all matches which are in the terms


回答1:


The longer alternatives should come before the shorter ones, thus, you need to sort the keywords by length in the descending order:

pat = r'\b(?:{})\b'.format('|'.join(sorted(terms,key=len,reverse=True)))

The result will be \b(?:foo baz|foo|baz)\b pattern. It will first try to match foo baz, then foo, then baz. If foo baz is found, the match is returned, then the next match is searched for from the end of the match, so you won't match foo or baz found with the previous match again.

See more on this in "Remember That The Regex Engine Is Eager".




回答2:


Instead of using the regex pattern for checking the presence of terms,

#create pattern
p = re.compile(pat)

#search for pattern in the column
results = [p.findall(text) for text in df.Match_text.tolist()]

Try using a simple lookup of terms in the text like this.

#search for each term in the column
results = [[term for term in terms if term in text] for text in df.Match_text.tolist()]

Output for the above looks like this,

    Match_text  results
0   foo abc [foo]
3   baz 45  [baz]
6   foo baz [foo, baz, foo baz]

NOTE : There is a time complexity associated to this method.



来源:https://stackoverflow.com/questions/61072826/how-get-all-matches-using-str-contains-in-python-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!