Java Lucene NGramTokenizer

后端 未结 4 1251
梦毁少年i
梦毁少年i 2021-01-04 05:31

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokeni

4条回答
  •  夕颜
    夕颜 (楼主)
    2021-01-04 06:18

    For recent version of Lucene (4.2.1), this is a clean code which works. Before executing this code, you have to import 2 jar files:

    • lucene-core-4.2.1.jar
    • lucene-analuzers-common-4.2.1.jar

    Find these files at http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1

    //LUCENE 4.2.1
    Reader reader = new StringReader("This is a test string");      
    NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
    
    CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
    
    while (gramTokenizer.incrementToken()) {
        String token = charTermAttribute.toString();
        System.out.println(token);
    }
    

提交回复
热议问题