Java Lucene Stop Words Filter

故事扮演 提交于 2019-12-13 03:43:44

问题


I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:

for(String curS: Sentences)
{
          reader = new StringReader(curS);
          tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
          tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
          tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
          tokenizer = new ShingleFilter(tokenizer, 2, 3);
          charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

    while(tokenizer.incrementToken())
    {
        curNGram = charTermAttribute.toString().toString();
        nGrams.add(curNGram);                   //store each token into an ArrayList
    }
}

For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.

  1. Why are stop words being added to my list when I am using the StopFiler?

All help is appreciated!


回答1:


What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.

Try something like:

//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);

Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.

Were you already generating stopWords like that?



来源:https://stackoverflow.com/questions/13501421/java-lucene-stop-words-filter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!