Stanford coreNLP - split words ignoring apostrophe

前端 未结 3 1879
执笔经年
执笔经年 2020-12-20 01:08

I\'m trying to split a sentence into words using Stanford coreNLP . I\'m having problem with words that contains apostrophe.

For example, the sentence: I\'m 24 years

3条回答
  •  不知归路
    2020-12-20 01:15

    Currently, no. The subsequent Stanford CoreNLP processing tools all use Penn Treebank tokenization, which splits contractions into two tokens (regarding "I'm" as a reduced form of "I am" by making it the two "words" [I] ['m]). It sounds like you want a different type of tokenization.

    While there are some tokenization options, there isn't one to change this, and subsequent tools (like the POS tagger or parser) would work badly without contractions being split. You could add such an option to the tokenizer, changing (deleting) the treatment of REDAUX and SREDAUX trailing contexts.

    You can also join contractions via post processing as @dhg suggests, but you'd want to do it a little more carefully in the "if" so it didn't join on quotes.

提交回复
热议问题