问题

What are the elegant and effective ways to count the frequency of each "english" word in a file?

回答1:

First of all, I define letter_only std::locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype<char> 
{
    letter_only(): std::ctype<char>(get_table()) {}

    static std::ctype_base::mask const* get_table()
    {
        static std::vector<std::ctype_base::mask> 
            rc(std::ctype<char>::table_size,std::ctype_base::space);

        std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
        return &rc[0];
    }
};

Solution 1

int main()
{
     std::map<std::string, int> wordCount;
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     std::string word;
     while(input >> word)
     {
         ++wordCount[word];
     }
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
           cout << it->first <<" : "<< it->second << endl;
     }
}

Solution 2

struct Counter
{
    std::map<std::string, int> wordCount;
    void operator()(const std::string & item) { ++wordCount[item]; }
    operator std::map<std::string, int>() { return wordCount; }
};

int main()
{
     ifstream input;
     input.imbue(std::locale(std::locale(), new letter_only())); //enable reading only letters!
     input.open("filename.txt");
     istream_iterator<string> start(input);
     istream_iterator<string> end;
     std::map<std::string, int> wordCount = std::for_each(start, end, Counter());
     for (std::map<std::string, int>::iterator it = wordCount.begin(); it != wordCount.end(); ++it)
     {
          cout << it->first <<" : "<< it->second << endl;
     }
 }

回答2:

Here is working solution.This should work with real text (including punctuation) :

#include <iterator>
#include <iostream>
#include <fstream>
#include <map>
#include <string>
#include <cctype>

std::string getNextToken(std::istream &in)
{
    char c;
    std::string ans="";
    c=in.get();
    while(!std::isalpha(c) && !in.eof())//cleaning non letter charachters
    {
        c=in.get();
    }
    while(std::isalpha(c))
    {
        ans.push_back(std::tolower(c));
        c=in.get();
    }
    return ans;
}

int main()
{
    std::map<std::string,int> words;
    std::ifstream fin("input.txt");

    std::string s;
    std::string empty ="";
    while((s=getNextToken(fin))!=empty )
            ++words[s];

    for(std::map<std::string,int>::iterator iter = words.begin(); iter!=words.end(); ++iter)
        std::cout<<iter->first<<' '<<iter->second<<std::endl;
}

Edit: Now my code calling tolower for every letter.

回答3:

My solution is the following one. Firstly, all symbols are converted to spaces. Then, basically the same solution provided here before is used in order to extract words:

const std::string Symbols = ",;.:-()\t!¡¿?\"[]{}&<>+-*/=#'";
typedef std::map<std::string, unsigned int> WCCollection;
void countWords(const std::string fileName, WCCollection &wcc)
    {
        std::ifstream input( fileName.c_str() );

        if ( input.is_open() ) {
            std::string line;
            std::string word;

            while( std::getline( input, line ) ) {
                // Substitute punctuation symbols with spaces
                for(std::string::const_iterator it = line.begin(); it != line.end(); ++it) {
                    if ( Symbols.find( *it ) != std::string::npos ) {
                        *it = ' ';
                    }

                }

                // Let std::operator>> separate by spaces
                std::istringstream filter( line );
                while( filter >> word ) {
                    ++( wcc[word] );
                }
            }
        }

    }

回答4:

Pseudocode for an algorithm which I believe to be close to what you want:

counts = defaultdict(int)
for line in file:
  for word in line.split():
    if any(x.isalpha() for x in word):
      counts[word.toupper()] += 1

freq = sorted(((count, word) for word, count in counts.items()), reversed=True)
for count, word in freq:
  print "%d\t%s" % (count, word)

Case-insensitive comparison is handled naïvely and probably combines words you don't want to combine in an absolutely general sense. Be careful of non-ASCII characters in your implementation of the above. False positives may include "1-800-555-TELL", "0xDEADBEEF", and "42 km", depending on what you want. Missed words include "911 emergency services" (I'd probably want that counted as three words).

In short, natural language parsing is hard: you probably can make due with some approximation depending on your actual use case.

回答5:

Decide on exactly what you mean by "an English word". The definition should cover things like whether "able-bodied" is one word or two, how to handle apostrophes ("Don't trust 'em!"), whether capitalization is significant, etc.
Create a set of test cases so you can be sure you get all the decisions in step 1 correct.
Create a tokenizer that reads the next word (as defined in step 1) from the input and returns it in a standard form. Depending on how your definition, this might be a simple state machine, a regular expression, or just relying on <istream>'s extraction operators (e.g., std::cin >> word;). Test your tokenizer with all the test cases from step 2.
Choose a data structure for keeping the words and counts. In modern C++, you'd probably end up with something like std::map<std::string, unsigned> or std::unordered_map<std::string, int>.
Write a loop that gets the next word from the tokenizer and increments its count in the histogram until there are no more words in the input.

回答6:

Perl is arguably not so elegant, but very effective.
I posted a solution here: Processing huge text files

In a nutshell,

1) If needed, strip punctuation and convert uppercase to lowercase:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file

2) Count the occurrence of each word. Print results sorted first by frequency, and then alphabetically:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq

I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in under 3 minutes.

回答7:

One more simple way is to count the number of spaces in the file till more then one space was found, if you consider only single space between words...

回答8:

string mostCommon( string filename ) {

    ifstream input( filename );
    string line;
    string mostFreqUsedWord;
    string token;
    map< string, int > wordFreq;

    if ( input.is_open() ) {

        while ( true ) {
            input >> token;
            if( input ) {
                wordFreq[ token ]++;
                if ( wordFreq[ token] > wordFreq[ mostFreqUsedWord ] )
                    mostFreqUsedWord = token;
            } else
                break;
        }
        input.close();
    } else {
        cout << "Unable to ope file." << endl;
    }
    return mostFreqUsedWord;
}

来源：https://stackoverflow.com/questions/4888879/elegant-ways-to-count-the-frequency-of-words-in-a-file

标签

c++

file-io

Elegant ways to count the frequency of words in a file

问题