Stanford coreNLP - split words ignoring apostrophe

前端 未结 3 1867
执笔经年
执笔经年 2020-12-20 01:08

I\'m trying to split a sentence into words using Stanford coreNLP . I\'m having problem with words that contains apostrophe.

For example, the sentence: I\'m 24 years

3条回答
  •  天涯浪人
    2020-12-20 01:16

    How about if you just re-concatenate tokens that are split by an apostrophe?

    Here's an implementation in Java:

    public static List tokenize(String s) {
        PTBTokenizer ptbt = new PTBTokenizer(
                new StringReader(s), new CoreLabelTokenFactory(), "");
        List sentence = new ArrayList();
        StringBuilder sb = new StringBuilder();
        for (CoreLabel label; ptbt.hasNext();) {
            label = ptbt.next();
            String word = label.word();
            if (word.startsWith("'")) {
                sb.append(word);
            } else {
                if (sb.length() > 0)
                    sentence.add(sb.toString());
                sb = new StringBuilder();
                sb.append(word);
            }
        }
        if (sb.length() > 0)
            sentence.add(sb.toString());
        return sentence;
    }
    
    public static void main(String[] args) {
        System.out.println(tokenize("I'm 24 years old."));  // [I'm, 24, years, old, .]
    }
    

提交回复
热议问题