问题
I have a list of words and a string and would like to get back a list of words from the original list which are found in the string.
Ex:
import re
lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'
pattern = re.compile(r"(?=(\b" + r"\b|".join(map(re.escape, lof_terms)) + r"\b))")
found_terms = re.findall(pattern, str_content)
This will only return ['car', 'popular']. It fails to catch 'car manufacturer'. However it will catch it if I change the source list of terms to
lof_terms = ['car manufacturer', 'popular']
Somehow the overlapping between 'car' and 'car manufacturer' seems to be source of this issue.
Any ideas how to get over this?
Many thanks
回答1:
The current code can be fixed if you first sort the lof_terms by length in the descending order:
rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)
Note that in this case, \b word boundaries are only used once on either end of the grouping, no need to repeat them around each alternative. See this regex demo.
See the Python demo:
import re
lof_terms = ['car', 'car manufacturer', 'popular']
str_content = 'This is a very popular car manufacturer.'
rx = r"(?=\b({})\b)".format("|".join(map(re.escape, sorted(lof_terms, key=len, reverse=True))))
pattern = re.compile(rx)
found_terms = re.findall(pattern, str_content)
print(found_terms)
# => ['popular', 'car manufacturer']
来源:https://stackoverflow.com/questions/65290426/python-regex-matching-multiple-words-from-a-list