Stop words and stemmer in java

纵饮孤独 提交于 2019-12-03 04:49:25

If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

Which if used on your strings like this:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

Yields this output:

decid bui someth from shop
Nevertheless decidedli bought someth from shop

Yes, you can wrap any stemmer so that you can write something like

String stemmedString = stemmer.stemAndRemoveStopwords(inputString, stopWordList);

Internally, your stemAndRemoveStopwords would

  • place all stopWords in a Map for fast reference
  • initialize an empty StringBuilder to holde the output string
  • iterate over all words in the input string, and for each word
    • search for it in the stopWordList; if found, continue to top of loop
    • otherwise, stem it using your preferred stemmer, and add it to to the output string
  • return the output string

You don't have to deal with the whole text. Just split it, apply your stopword filter and stemming algorithm, then build the string again using a StringBuilder:

StrinBuilder builder = new StringBuilder(text.length());
String[] words = text.split("\\s+");
for (String word : words) {
    if (stopwordFilter.check(word)) { // Apply stopword filter.
        word = stemmer.stem(word); // Apply stemming algorithm.
        builder.append(word);
    }
}
text = builder.toString();
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!