I\'m trying to split a sentence into words using Stanford coreNLP . I\'m having problem with words that contains apostrophe.
For example, the sentence: I\'m 24 years
How about if you just re-concatenate tokens that are split by an apostrophe?
Here's an implementation in Java:
public static List tokenize(String s) {
PTBTokenizer ptbt = new PTBTokenizer(
new StringReader(s), new CoreLabelTokenFactory(), "");
List sentence = new ArrayList();
StringBuilder sb = new StringBuilder();
for (CoreLabel label; ptbt.hasNext();) {
label = ptbt.next();
String word = label.word();
if (word.startsWith("'")) {
sb.append(word);
} else {
if (sb.length() > 0)
sentence.add(sb.toString());
sb = new StringBuilder();
sb.append(word);
}
}
if (sb.length() > 0)
sentence.add(sb.toString());
return sentence;
}
public static void main(String[] args) {
System.out.println(tokenize("I'm 24 years old.")); // [I'm, 24, years, old, .]
}