Read a .txt file and return a list of words with their frequency in the file

问题

I have this so far but it only prints the .txt file to the screen:

import java.io.*;

public class ReadFile {
    public static void main(String[] args) throws IOException {
        String Wordlist;
        int Frequency;

        File file = new File("file1.txt");
        BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
        String line = null;

        while( (line = br.readLine()) != null) {
            String [] tokens = line.split("\\s+");
            System.out.println(line);
        }
    }
}

Can anyone help me so it prints a word list and the words frequency?

回答1:

Do something like this. I'm assuming only comma or period could occur in the file. Else you'll have to remove other punctuation characters as well. I'm using a TreeMap so the words in the map will be stored their natural alphabetical order

  public static TreeMap<String, Integer> generateFrequencyList()
    throws IOException {
    TreeMap<String, Integer> wordsFrequencyMap = new TreeMap<String, Integer>();
    String file = "/tmp/lorem.txt";
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    while( (line = br.readLine()) != null){
         String [] tokens = line.split("\\s+");
      for (String token : tokens) {
        token = removePunctuation(token);
        if (!wordsFrequencyMap.containsKey(token.toLowerCase())) {
          wordsFrequencyMap.put(token.toLowerCase(), 1);
        } else {
          int count = wordsFrequencyMap.get(token.toLowerCase());
          wordsFrequencyMap.put(token.toLowerCase(), count + 1);
        }
      }
    }
    return wordsFrequencyMap;
  }

  private static String removePunctuation(String token) {
    token = token.replaceAll("[^a-zA-Z]", "");
    return token;
  }

main method for testing is shown below. For getting the percentages, you could get count of all the words by iterating through the map and adding all the values and then do a second pass for getting the percentages. By the way, if this is part of a larger work, you could also take a look at apache commons math library for calculating Frequency distributions. If you use their Frequency class, you can keep adding all the words to it and then get the descriptive statistics at the end.

  public static void main(String[] args) {
    try {
      int totalWords = 0;   
      TreeMap<String, Integer> freqMap = generateFrequencyList();
      for (String key : freqMap.keySet()) {
        totalWords += freqMap.get(key);
      }

      System.out.println("Word\tCount\tPercentage");
      for (String key : freqMap.keySet()) {
         System.out.println(key+"\t"+freqMap.get(key)+"\t"+((double)freqMap.get(key)*100.0/(double)totalWords));    
      }
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

回答2:

Does it have to be in Java? This does the job:

sed 's/[^A-Za-z]/\n/g' filename.txt | sort | uniq -c

Basically, turn any non-alphabetic character into a newline, sort the list of items, and let uniq count the occurrences. Just discard the first line of output, which is the number of empty lines. This is fast to run, and even faster to code.

You can adjust the regular expression to taste, for example including digits[A-Za-z0-9] or accented character for foreign languages [A-Za-zàèìòù].

回答3:

Create a HashMap

HashMap<String, Integer> occurrences = new HashMap<String, Integer>();

Iterate through the array of each line

for(String word: tokens) {
  // Do stuff
}

Then check if the word has already be read before for each word

if(occurrences.containsKey(word))
    occurrences.put(word, occurrences.get(word)+1);
else
    occurrences.put(word, 1);

Full version:

String Wordlist;
int Frequency;

File file = new File("file1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));

HashMap<String, int> occurrences = new HashMap<String, int>();

String line = null;

while( (line = br.readLine()) != null){
    String [] tokens = line.split("\\s+");

    for(String word: tokens) {
        if(occurences.contains(word))
            occurences.put(word, occurences.get(word)+1);
        else
            occurences.put(word, 1);
    } 
}

Might be a typo in it, haven't tested it, but this should do the job.

来源：https://stackoverflow.com/questions/27325042/read-a-txt-file-and-return-a-list-of-words-with-their-frequency-in-the-file

标签

java

filereader