Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of
Just for fun, I did a solution in c++0x style, using Boost MultiIndex.
This style would be quite clumsy without the auto keyword (type inference).
By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.
To compile and run:
g++ --std=c++0x -O3 test.cpp -o test
curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
time ./test
.
#include
#include
#include
#include
#include
#include
#include
#include
using namespace std;
struct entry
{
string word;
size_t freq;
void increment() { freq++; }
};
struct byword {}; // TAG
struct byfreq {}; // TAG
int main()
{
using ::boost::lambda::_1;
using namespace ::boost::multi_index;
multi_index_container,
ordered_unique , member >, // alphabetically
ordered_non_unique, member > // by frequency
> > tally;
ifstream inFile("bible.txt");
string s;
while (inFile>>s)
{
auto& lookup = tally.get();
auto it = lookup.find(s);
if (lookup.end() != it)
lookup.modify(it, boost::bind(&entry::increment, _1));
else
lookup.insert({s, 1});
}
BOOST_FOREACH(auto e, tally.get().range(800 <= _1, _1 <= 1000))
cout << e.freq << "\t" << e.word << endl;
}
Note how
entry type instead of using std::pair(for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:
tally.get
The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).