问题
Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching.
How can I combine them in a neat way?
For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example
Rule-based matching can have POS rules?
If not - here is my current plan - gather everything in one string and apply regex
import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)
concatPos = ''
print(text)
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')
# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-
回答1:
Sure, simply use the POS attribute.
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])
doc = nlp(u'what are the main issues')
matches = matcher(doc)
回答2:
Eyal Shulman's answer was helpful, but it makes you hard code a pattern matcher, not exactly use a regular expression.
I wanted to use regular expressions, so I made my own solution:
pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*'
## create a string with the pos of the sentence
posString = ""
for w in doc[start:end].sent:
posString += "<" + w.pos_ + ">"
lstVerb = []
for m in re.compile(pattern).finditer(posString):
## each m is a verb phrase match
## count the "<" in m to find how many tokens we want
numTokensInGroup = m.group().count('<')
## then find the number of tokens that came before that group.
numTokensBeforeGroup = posString[:m.start()].count('<')
verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
## starting at character offset m.start()
lstVerb.append(verbPhrase)
来源:https://stackoverflow.com/questions/42830248/how-to-write-spacy-matcher-of-pos-regex