How to identify Abbreviations/Acronyms and expand them in spaCy?

问题

I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ).

I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded abbreviation as keywords and then using some kind of pipeline extension to then go through the matches and replace them with the full expanded abbreviation + abbreviation.

Is there a better way of tagging and resolving acronyms/abbreviations in spaCy?

回答1:

Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like

StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!

A working way to replace all acronyms in a piece of text with their long form could then be

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"

def replace_acronyms(text):
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)

    return(" ".join(altered_tok))

replace_acronyms(text)

来源：https://stackoverflow.com/questions/52570805/how-to-identify-abbreviations-acronyms-and-expand-them-in-spacy

标签

python-3.x

nlp

spacy