Example using WikipediaTokenizer in Lucene

ε祈祈猫儿з 提交于 2019-12-19 11:42:31

问题


I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.

Thank you.


回答1:


In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.

Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.

...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

while (tf.incrementToken()) {
  String tokText = termAtt.term();
  //System.out.println("Text: " + tokText + " Type: " + token.type());
  String expectedType = (String) tcm.get(tokText);
  assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
  assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
  count++;
  if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
    numItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
    numBoldItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
    numCategory++;
  }
  else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
    numCitation++;
  }
}
...



回答2:


WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

Token token = new Token();

token = tf.next(token);

http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

Regards




回答3:


public class WikipediaTokenizerTest { static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class); protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";

public WikipediaTokenizer testSimple() throws Exception {
    String text = "This is a [[Category:foo]]";
    return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
    WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();

    try {
        WikipediaTokenizer x = wtt.testSimple();

        logger.info(x.hasAttributes());

        Token token = new Token();
        int count = 0;
        int numItalics = 0;
        int numBoldItalics = 0;
        int numCategory = 0;
        int numCitation = 0;

        while (x.incrementToken() == true) {
            logger.info("seen something");
        }

    } catch(Exception e){
        logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
    }


}


来源:https://stackoverflow.com/questions/3924543/example-using-wikipediatokenizer-in-lucene

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!