How to remove stop words in java?

问题

I want to remove stop words in java.

So, I read stop words from text file.

and store Set

Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("stopwords.txt"));
        String words = null;
        while( (words = br.readLine()) != null) {
            stopWords.add(words.trim());
            }
        br.close();

And, I read another text file.

So, I wanna remove to duplicate string in text file.

How can I?

回答1:

You want to remove duplicate words from file, below is the high level logic for same.

Read File
Loop through file content(i.e one line at a time)
- Have string tokenizer for that line based on space
- Add each each token to your set. This will make sure that you have only one entry per word.
- Close file

Now you have set that contains all the unique word of file.

回答2:

using set for stopword :

Set<String> stopWords = new LinkedHashSet<String>();
        BufferedReader SW= new BufferedReader(new FileReader("StopWord.txt"));
        for(String line;(line = SW.readLine()) != null;)
           stopWords.add(line.trim());
        SW.close();

and ArrayList for input txt_file

BufferedReader br = new BufferedReader(new FileReader(txt_file.txt));
//make your arraylist here

// function deletStopWord() for remove all stopword in your "stopword.txt"
public ArrayList<String> deletStopWord(Set stopWords,ArrayList arraylist){
        System.out.println(stopWords.contains("?"));
        ArrayList<String> NewList = new ArrayList<String>();
        int i=3;
        while(i < arraylist.size() ){
            if(!stopWords.contains(arraylist.get(i))){
                NewList.add((String) arraylist.get(i));
            }
            i++;        
            }
        System.out.println(NewList);
        return NewList;
    }

  arraylist=deletStopWord(stopWords,arraylist);

回答3:

Using the ArrayList may be more easier.

public ArrayList removeDuplicates(ArrayList source){
    ArrayList<String> newList = new ArrayList<String>();
    for (int i=0; i<source.size(); i++){
        String s = source.get(i);
        if (!newList.contains(s)){
            newList.add(s);
        }
    }
    return newList;
}

Hope this helps.

回答4:

If you simply want to remove a certain set of words from the words in a file, you can do it however you want. But if you are dealing with a problem involving natural language processing, you should use a library.

For example, using Lucene for tokenizing will seem more complicated at first, but it will deal with myriad complications that you will overlook, and allow for great flexibility should you change your mind on the specific stopwords, on how you are tokenizing, whether you care about case, etc.

回答5:

You should try using StringTokenizer.

回答6:

it may be late reply, hope it may help someone few days back created the small util library to remove stop/stemmer words from the given text and its in maven repository/github

exude library

来源：https://stackoverflow.com/questions/12469332/how-to-remove-stop-words-in-java

标签

java

stop-words