Is there a simple way I could use any subclass of Lucene\'s Analyzer to parse/tokenize a String?
Something like:
String to_
Even better by using try-with-resources! This way you don't have to explicitly call .close() that is required in higher versions of the library.
public static List tokenizeString(Analyzer analyzer, String string) {
List tokens = new ArrayList<>();
try (TokenStream tokenStream = analyzer.tokenStream(null, new StringReader(string))) {
tokenStream.reset(); // required
while (tokenStream.incrementToken()) {
tokens.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
new RuntimeException(e); // Shouldn't happen...
}
return tokens;
}
And the Tokenizer version:
try (Tokenizer standardTokenizer = new HMMChineseTokenizer()) {
standardTokenizer.setReader(new StringReader("我说汉语说得很好"));
standardTokenizer.reset();
while(standardTokenizer.incrementToken()) {
standardTokenizer.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
new RuntimeException(e); // Shouldn't happen...
}