Count the number of times each word occurs in a file

前端未结

关注

 4  2018

挽巷 2020-12-12 02:51

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of

4条回答

不思量自难忘° (楼主)

2020-12-12 03:05
Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

This style would be quite clumsy without the auto keyword (type inference).

By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

To compile and run:
```
g++ --std=c++0x -O3 test.cpp -o test
curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
    tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
time ./test
```
.
```
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

using namespace std;

struct entry 
{
    string word;
    size_t freq;
    void increment() { freq++; }
};

struct byword {}; // TAG
struct byfreq {}; // TAG

int main() 
{
    using ::boost::lambda::_1;
    using namespace ::boost::multi_index;
    multi_index_container,
            ordered_unique    , member >, // alphabetically
            ordered_non_unique, member > // by frequency
                > > tally;

    ifstream inFile("bible.txt");
    string s;
    while (inFile>>s)
    {
        auto& lookup = tally.get();
        auto it = lookup.find(s);

        if (lookup.end() != it)
            lookup.modify(it, boost::bind(&entry::increment, _1));
        else
            lookup.insert({s, 1});
    }

    BOOST_FOREACH(auto e, tally.get().range(800 <= _1, _1 <= 1000))
        cout << e.freq << "\t" << e.word << endl;

}
```
Note how
- it became just slightly more convenient to define a custom entry type instead of using std::pair
- (for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:
  
  tally.get().range(800 <= _1, _1 <= 1000)
The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...