How to use a Lucene Analyzer to tokenize a String?

后端 未结 4 1312
抹茶落季
抹茶落季 2020-12-04 16:38

Is there a simple way I could use any subclass of Lucene\'s Analyzer to parse/tokenize a String?

Something like:

String to_         


        
4条回答
  •  感情败类
    2020-12-04 17:34

    Even better by using try-with-resources! This way you don't have to explicitly call .close() that is required in higher versions of the library.

    public static List tokenizeString(Analyzer analyzer, String string) {
      List tokens = new ArrayList<>();
      try (TokenStream tokenStream  = analyzer.tokenStream(null, new StringReader(string))) {
        tokenStream.reset();  // required
        while (tokenStream.incrementToken()) {
          tokens.add(tokenStream.getAttribute(CharTermAttribute.class).toString());
        }
      } catch (IOException e) {
        new RuntimeException(e);  // Shouldn't happen...
      }
      return tokens;
    }
    

    And the Tokenizer version:

      try (Tokenizer standardTokenizer = new HMMChineseTokenizer()) {
        standardTokenizer.setReader(new StringReader("我说汉语说得很好"));
        standardTokenizer.reset();
        while(standardTokenizer.incrementToken()) {
          standardTokenizer.getAttribute(CharTermAttribute.class).toString());
        }
      } catch (IOException e) {
          new RuntimeException(e);  // Shouldn't happen...
      }
    

提交回复
热议问题