Patterns with ENT_TYPE from manually labelled Span not working

问题

As an alternative to accomplishing this: Patterns with multi-terms entries in the IN attribute

I wrote the following code to match phrases, label them, and then use them in EntityRuler patterns:

# %%
import spacy
from spacy.matcher import PhraseMatcher
from spacy.pipeline import EntityRuler
from spacy.tokens import Span

class PhraseRuler(object):
    name = 'phrase_ruler'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(term) for term in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for label, start, end in matches:
            span = Span(doc, start, end, label=label)
            spans.append(span)
        doc.ents = spans
        return doc

nlp = spacy.load("en_core_web_lg")

entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
nlp.add_pipe(entity_matcher, before="ner")


ruler = EntityRuler(nlp)
patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"ENT_TYPE": "GREETING"}]}]
ruler.add_patterns(patterns)
#ruler.to_disk("./data/patterns.jsonl")
nlp.add_pipe(ruler)

print(nlp.pipe_names) 

doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
print(doc.to_json())

Unfortunately this does not work as it does not return my SUPER_GREETING:

'ents': [
   {'start': 0, 'end': 4, 'label': 'PERSON'}, 
   {'start': 10, 'end': 21, 'label': 'GREETING'}, 
   {'start': 39, 'end': 51, 'label': 'GREETING'}
]

What am I doing wrong? How do I fix it?

回答1:

You have the right idea, but the problem here is an intrinsic design choice in spaCy that any token can only be part of one named entity. So you can't have "Warm Welcome" being both a "GREETING" as well as part of a "SUPER_GREETING".

One way you could work around this is by using custom extensions. For instance, one solution would be to store the GREETING bit on the token level:

Token.set_extension("mylabel", default="")

And then we adjust the PhraseRuler.__call__ so that it doesn't write to doc.ents but instead does this:

for token in span:
    token._.mylabel = "MY_GREETING"

Now, we can rewrite the SUPER_GREETING pattern to:

patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]

which will match "super" followed by one or more "MY_GREETING" tokens. It will match greedily and output "super Warm Welcome" as hit.

Here's the resulting code snippet, starting from your code and making the adjustements as described:

    Token.set_extension("mylabel", default="")

    class PhraseRuler(object):
        name = 'phrase_ruler'

        def __init__(self, nlp, terms, label):
            patterns = [nlp(term) for term in terms]
            self.matcher = PhraseMatcher(nlp.vocab)
            self.matcher.add(label, None, *patterns)

        def __call__(self, doc):
            matches = self.matcher(doc)
            for label, start, end in matches:
                span = Span(doc, start, end, label=label)
                for token in span:
                    token._.mylabel = "MY_GREETING"
            return doc

    nlp = spacy.load("en_core_web_lg")

    entity_matcher = PhraseRuler(nlp, ["Best Wishes", "Warm Welcome"], "GREETING")
    nlp.add_pipe(entity_matcher, name="entity_matcher", before="ner")

    ruler = EntityRuler(nlp)
    patterns = [{"label": "SUPER_GREETING", "pattern": [{"LOWER": "super"}, {"_": {"mylabel": "MY_GREETING"}, "OP": "+"}]}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler, after="entity_matcher")

    print(nlp.pipe_names)

    doc = nlp("Mary said Best Wishes and I said super Warm Welcome.")
    print("TOKENS:")
    for token in doc:
        print(token.text, token._.mylabel)
    print()

    print("ENTITIES:")
    for ent in doc.ents:
        print(ent.text, ent.label_)

Which outputs

TOKENS:
Mary 
said 
Best MY_GREETING
Wishes MY_GREETING
and 
I 
said 
super 
Warm MY_GREETING
Welcome MY_GREETING
. 

ENTITIES:
Mary PERSON
super Warm Welcome SUPER_GREETING

This may not be exactly what you need/want - but I hope it helps you move forward with an alternative solution for your specific use-case. If you do want the normal "GREETING" spans in the final doc.ents, maybe you can reassemble them in post-processing, after the EntityRuler has run, e.g. by moving the custom attributes to doc.ents if they don't overlap, or by keeping a cache of the spans somewhere.

来源：https://stackoverflow.com/questions/62019695/patterns-with-ent-type-from-manually-labelled-span-not-working

标签

python

nlp

spacy