Stanford coreNLP - split words ignoring apostrophe

前端未结

关注

 3  1878

执笔经年 2020-12-20 01:08

I\'m trying to split a sentence into words using Stanford coreNLP . I\'m having problem with words that contains apostrophe.

For example, the sentence: I\'m 24 years

3条回答

天涯浪人 (楼主)

2020-12-20 01:16

How about if you just re-concatenate tokens that are split by an apostrophe?

Here's an implementation in Java:

public static List tokenize(String s) {
    PTBTokenizer ptbt = new PTBTokenizer(
            new StringReader(s), new CoreLabelTokenFactory(), "");
    List sentence = new ArrayList();
    StringBuilder sb = new StringBuilder();
    for (CoreLabel label; ptbt.hasNext();) {
        label = ptbt.next();
        String word = label.word();
        if (word.startsWith("'")) {
            sb.append(word);
        } else {
            if (sb.length() > 0)
                sentence.add(sb.toString());
            sb = new StringBuilder();
            sb.append(word);
        }
    }
    if (sb.length() > 0)
        sentence.add(sb.toString());
    return sentence;
}

public static void main(String[] args) {
    System.out.println(tokenize("I'm 24 years old."));  // [I'm, 24, years, old, .]
}

0 讨论(0)

查看其它3个回答