Count the number of times each word occurs in a file

前端 未结 4 2011
挽巷
挽巷 2020-12-12 02:51

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of

相关标签:
4条回答
  • 2020-12-12 03:01

    One solution could be this : define letter_only locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

    struct letter_only: std::ctype<char> 
    {
        letter_only(): std::ctype<char>(get_table()) {}
    
        static std::ctype_base::mask const* get_table()
        {
            static std::vector<std::ctype_base::mask> 
                rc(std::ctype<char>::table_size,std::ctype_base::space);
    
            std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
            return &rc[0];
        }
    };
    

    And then use it as:

    int main()
    {
         std::map<std::string, int> wordCount;
         ifstream input;
    
         //enable reading only english letters only!
         input.imbue(std::locale(std::locale(), new letter_only())); 
    
         input.open("filename.txt");
         std::string word;
         std::string uppercase_word;
         while(input >> word)
         {
             std::transform(word.begin(), 
                            word.end(), 
                            std::back_inserter(uppercase_word),
                            (int(&)(int))std::toupper); //the cast is needed!
             ++wordCount[uppercase_word];
         }
         for (std::map<std::string, int>::iterator it = wordCount.begin(); 
                                                   it != wordCount.end(); 
                                                   ++it)
         {
               std::cout << "word = "<< it->first 
                         <<" : count = "<< it->second << std::endl;
         }
    }
    
    0 讨论(0)
  • 2020-12-12 03:05

    Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

    This style would be quite clumsy without the auto keyword (type inference).

    By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

    To compile and run:

    g++ --std=c++0x -O3 test.cpp -o test
    curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
        tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
    time ./test
    

    .

    #include <boost/foreach.hpp>
    #include <boost/lambda/lambda.hpp>
    #include <boost/multi_index_container.hpp>
    #include <boost/multi_index/ordered_index.hpp>
    #include <boost/multi_index/member.hpp>
    #include <fstream>
    #include <iostream>
    #include <string>
    
    using namespace std;
    
    struct entry 
    {
        string word;
        size_t freq;
        void increment() { freq++; }
    };
    
    struct byword {}; // TAG
    struct byfreq {}; // TAG
    
    int main() 
    {
        using ::boost::lambda::_1;
        using namespace ::boost::multi_index;
        multi_index_container<entry, indexed_by< // sequenced<>,
                ordered_unique    <tag<byword>, member<entry,string,&entry::word> >, // alphabetically
                ordered_non_unique<tag<byfreq>, member<entry,size_t,&entry::freq> > // by frequency
                    > > tally;
    
        ifstream inFile("bible.txt");
        string s;
        while (inFile>>s)
        {
            auto& lookup = tally.get<byword>();
            auto it = lookup.find(s);
    
            if (lookup.end() != it)
                lookup.modify(it, boost::bind(&entry::increment, _1));
            else
                lookup.insert({s, 1});
        }
    
        BOOST_FOREACH(auto e, tally.get<byfreq>().range(800 <= _1, _1 <= 1000))
            cout << e.freq << "\t" << e.word << endl;
    
    }
    

    Note how

    • it became just slightly more convenient to define a custom entry type instead of using std::pair
    • (for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:

      tally.get<byfreq>().range(800 <= _1, _1 <= 1000)

    The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).

    0 讨论(0)
  • 2020-12-12 03:19

    He. I know bluntly showing a solution is not really helping you. However.

    I glanced through your code and saw many unused and confused bits. Here's what I'd do:

    #include <algorithm>
    #include <fstream>
    #include <functional>
    #include <iostream>
    #include <iterator>
    #include <map>
    #include <string>
    #include <vector>
    
    using namespace std;
    
    // types
    typedef std::pair<string, size_t> frequency_t;
    typedef std::vector<frequency_t> words_t;
    
    // predicates
    static bool byDescendingFrequency(const frequency_t& a, const frequency_t& b)
    { return a.second > b.second; }
    
    const struct isGTE // greater than or equal
    { 
        size_t inclusive_threshold;
        bool operator()(const frequency_t& record) const 
            { return record.second >= inclusive_threshold; }
    } over1000 = { 1001 }, over800  = { 800 };
    
    int main() 
    {
        words_t words;
        {
            map<string, size_t> tally;
    
            ifstream inFile("bible.txt");
            string s;
            while (inFile >> s)
                tally[s]++;
    
            remove_copy_if(tally.begin(), tally.end(), 
                           back_inserter(words), over1000);
        }
    
        words_t::iterator begin = words.begin(),
                          end = partition(begin, words.end(), over800);
        std::sort(begin, end, &byDescendingFrequency);
    
        for (words_t::const_iterator it=begin; it!=end; it++)
            cout << it->second << "\t" << it->first << endl;
    }
    

    Authorized Verion:

    993 because
    981 men
    967 day
    954 over
    953 God,
    910 she
    895 among
    894 these
    886 did
    873 put
    868 thine
    864 hand
    853 great
    847 sons
    846 brought
    845 down
    819 you,
    811 so
    

    Vulgata:

    995 tuum
    993 filius
    993 nec
    966 suum
    949 meum
    930 sum
    919 suis
    907 contra
    902 dicens
    879 tui
    872 quid
    865 Domine
    863 Hierusalem
    859 suam
    839 suo
    835 ipse
    825 omnis
    811 erant
    802 se
    

    Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map<> with boost::unordered_map<>

    0 讨论(0)
  • 2020-12-12 03:21

    A more efficient approach can be done with a single map< string, int > of occurrences, read words one by one, and increment the counter in m[ word ]. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap<int, string>. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...

    0 讨论(0)
提交回复
热议问题