Matcher is returning some duplicates entry

喜你入骨 提交于 2019-12-24 03:46:04

问题


I want output as ["good customer service","great ambience"] but I am getting ["good customer","good customer service","great ambience"] because pattern is matching with good customer also but this phrase doesn't make any sense. How can I remove these kind of duplicates

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("good customer service and great ambience")
matcher = Matcher(nlp.vocab)

# Create a pattern matching two tokens: adjective followed by one or more noun
 pattern = [{"POS": 'ADJ'},{"POS": 'NOUN', "OP": '+'}]

matcher.add("ADJ_NOUN_PATTERN", None,pattern)

matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])


回答1:


You may post-process the matches by grouping the tuples against the start index and only keeping the one with the largest end index:

from itertools import *

#...

matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for key, group in groupby(matches, lambda prop: prop[1])]    
print("Matches:", [doc[start:end].text for match_id, start, end in results])
# => Matches: ['good customer service', 'great ambience']

The groupby(matches, lambda prop: prop[1]) will group the matches by the start index, here, resulting in [(5488211386492616699, 0, 2), (5488211386492616699, 0, 3)] and (5488211386492616699, 4, 6). max(list(group),key=lambda x: x[2]) will grab the item where end index (Value #3) is the biggest.



来源:https://stackoverflow.com/questions/58815066/matcher-is-returning-some-duplicates-entry

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!