Count the number of times each word occurs in a file

前端 未结 4 2018
挽巷
挽巷 2020-12-12 02:51

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of

4条回答
  •  不思量自难忘°
    2020-12-12 03:05

    Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

    This style would be quite clumsy without the auto keyword (type inference).

    By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

    To compile and run:

    g++ --std=c++0x -O3 test.cpp -o test
    curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
        tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
    time ./test
    

    .

    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    #include 
    
    using namespace std;
    
    struct entry 
    {
        string word;
        size_t freq;
        void increment() { freq++; }
    };
    
    struct byword {}; // TAG
    struct byfreq {}; // TAG
    
    int main() 
    {
        using ::boost::lambda::_1;
        using namespace ::boost::multi_index;
        multi_index_container,
                ordered_unique    , member >, // alphabetically
                ordered_non_unique, member > // by frequency
                    > > tally;
    
        ifstream inFile("bible.txt");
        string s;
        while (inFile>>s)
        {
            auto& lookup = tally.get();
            auto it = lookup.find(s);
    
            if (lookup.end() != it)
                lookup.modify(it, boost::bind(&entry::increment, _1));
            else
                lookup.insert({s, 1});
        }
    
        BOOST_FOREACH(auto e, tally.get().range(800 <= _1, _1 <= 1000))
            cout << e.freq << "\t" << e.word << endl;
    
    }
    

    Note how

    • it became just slightly more convenient to define a custom entry type instead of using std::pair
    • (for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:

      tally.get().range(800 <= _1, _1 <= 1000)

    The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).

提交回复
热议问题