spacy rule-matcher extract value from matched sentence

点点圈 提交于 2019-12-11 05:37:02

问题


I have a custom rule matching in spacy, and I am able to match some sentences in a document. I would like to extract some numbers now from the matched sentences. However, the matched sentences do not have always have the same shape and form. What is the best way to do this?

# case 1:
texts = ["the surface is 31 sq",
"the surface is sq 31"
,"the surface is square meters 31"
,"the surface is 31 square meters"
,"the surface is about 31,2 square"
,"the surface is 31 kilograms"]

pattern = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$"}, "OP": "+"},
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
]

pattern_1 = [
    {"LOWER": "surface"}, 
    {"LEMMA": "be", "OP": "?"},  
    {"IS_ALPHA": True, "OP": "?"},
    {"LIKE_NUM": True},
    {"TEXT" : {"REGEX": "^(?i:sq(?:uare)?|m(?:et(?:er|re)s?)?)$", "OP": "+"}}
]

matcher = Matcher(nlp.vocab) 

matcher.add("Surface", None, pattern, pattern_1)

for index, text in enumerate(texts):
    print(f"Case {index}")
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
        print(match_id, string_id, start, end, span.text)

my output will be

Case 0
4898162435462687487 Surface 1 5 surface is 31 sq
Case 1
4898162435462687487 Surface 1 5 surface is sq 31
Case 2
4898162435462687487 Surface 1 6 surface is square meters 31
Case 3
4898162435462687487 Surface 1 5 surface is 31 square
Case 4
4898162435462687487 Surface 1 6 surface is about 31,2 square
Case 5

I would like to return the number (square meters) only. Something like [31, 31, 31, 31, 31.2] rather than the full text. What is the correct way to do this in spacy?


回答1:


Since each match contains a single occurrence of LIKE_NUM entity you may just parse the match subtree and return the first occurrence of such a token:

value = [token for token in span.subtree if token.like_num][0]

Test:

results = []
for text in texts:
    doc = nlp(text)
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]  # The matched span
        results.append([token for token in span.subtree if token.like_num][0])

print(results) # => [31, 31, 31, 31, 31,2]


来源:https://stackoverflow.com/questions/59070106/spacy-rule-matcher-extract-value-from-matched-sentence

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!