How to identify Abbreviations/Acronyms and expand them in spaCy?

折月煮酒 提交于 2020-07-08 17:10:39

问题


I have a large (~50k) term list and a number of these key phrases / terms have corresponding acronyms / abbreviations. I need a fast way of finding either the abbreviation or the expanded abbreviation ( i.e. MS -> Microsoft ) and then replacing that with the full expanded abbreviation + abbreviation ( i.e. Microsoft -> Microsoft (MS) or MS -> Microsoft (MS) ).

I am very new to spaCy, so my naive approach was going to be to use spacy_lookup and use both the abbreviation and the expanded abbreviation as keywords and then using some kind of pipeline extension to then go through the matches and replace them with the full expanded abbreviation + abbreviation.

Is there a better way of tagging and resolving acronyms/abbreviations in spaCy?


回答1:


Check out scispacy on GitHub, which implements the acronym identification heuristic described in this paper, (see also here). The heuristic works if acronyms are "introduced" in the text with a pattern like

StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!

A working way to replace all acronyms in a piece of text with their long form could then be

import spacy
from scispacy.abbreviation import AbbreviationDetector

nlp = spacy.load("en_core_web_sm")

abbreviation_pipe = AbbreviationDetector(nlp)
nlp.add_pipe(abbreviation_pipe)

text = "StackOverflow (SO) is a question and answer site for professional and enthusiast programmers. SO rocks!"

def replace_acronyms(text):
    doc = nlp(text)
    altered_tok = [tok.text for tok in doc]
    for abrv in doc._.abbreviations:
        altered_tok[abrv.start] = str(abrv._.long_form)

    return(" ".join(altered_tok))

replace_acronyms(text)


来源:https://stackoverflow.com/questions/52570805/how-to-identify-abbreviations-acronyms-and-expand-them-in-spacy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!