Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of
One solution could be this : define letter_only
locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".
struct letter_only: std::ctype<char>
{
letter_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table()
{
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);
return &rc[0];
}
};
And then use it as:
int main()
{
std::map<std::string, int> wordCount;
ifstream input;
//enable reading only english letters only!
input.imbue(std::locale(std::locale(), new letter_only()));
input.open("filename.txt");
std::string word;
std::string uppercase_word;
while(input >> word)
{
std::transform(word.begin(),
word.end(),
std::back_inserter(uppercase_word),
(int(&)(int))std::toupper); //the cast is needed!
++wordCount[uppercase_word];
}
for (std::map<std::string, int>::iterator it = wordCount.begin();
it != wordCount.end();
++it)
{
std::cout << "word = "<< it->first
<<" : count = "<< it->second << std::endl;
}
}
Just for fun, I did a solution in c++0x style, using Boost MultiIndex.
This style would be quite clumsy without the auto keyword (type inference).
By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.
To compile and run:
g++ --std=c++0x -O3 test.cpp -o test
curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |
tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt
time ./test
.
#include <boost/foreach.hpp>
#include <boost/lambda/lambda.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index/member.hpp>
#include <fstream>
#include <iostream>
#include <string>
using namespace std;
struct entry
{
string word;
size_t freq;
void increment() { freq++; }
};
struct byword {}; // TAG
struct byfreq {}; // TAG
int main()
{
using ::boost::lambda::_1;
using namespace ::boost::multi_index;
multi_index_container<entry, indexed_by< // sequenced<>,
ordered_unique <tag<byword>, member<entry,string,&entry::word> >, // alphabetically
ordered_non_unique<tag<byfreq>, member<entry,size_t,&entry::freq> > // by frequency
> > tally;
ifstream inFile("bible.txt");
string s;
while (inFile>>s)
{
auto& lookup = tally.get<byword>();
auto it = lookup.find(s);
if (lookup.end() != it)
lookup.modify(it, boost::bind(&entry::increment, _1));
else
lookup.insert({s, 1});
}
BOOST_FOREACH(auto e, tally.get<byfreq>().range(800 <= _1, _1 <= 1000))
cout << e.freq << "\t" << e.word << endl;
}
Note how
entry
type instead of using std::pair
(for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:
tally.get<byfreq>().range(800 <= _1, _1 <= 1000)
The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).
He. I know bluntly showing a solution is not really helping you. However.
I glanced through your code and saw many unused and confused bits. Here's what I'd do:
#include <algorithm>
#include <fstream>
#include <functional>
#include <iostream>
#include <iterator>
#include <map>
#include <string>
#include <vector>
using namespace std;
// types
typedef std::pair<string, size_t> frequency_t;
typedef std::vector<frequency_t> words_t;
// predicates
static bool byDescendingFrequency(const frequency_t& a, const frequency_t& b)
{ return a.second > b.second; }
const struct isGTE // greater than or equal
{
size_t inclusive_threshold;
bool operator()(const frequency_t& record) const
{ return record.second >= inclusive_threshold; }
} over1000 = { 1001 }, over800 = { 800 };
int main()
{
words_t words;
{
map<string, size_t> tally;
ifstream inFile("bible.txt");
string s;
while (inFile >> s)
tally[s]++;
remove_copy_if(tally.begin(), tally.end(),
back_inserter(words), over1000);
}
words_t::iterator begin = words.begin(),
end = partition(begin, words.end(), over800);
std::sort(begin, end, &byDescendingFrequency);
for (words_t::const_iterator it=begin; it!=end; it++)
cout << it->second << "\t" << it->first << endl;
}
Authorized Verion:
993 because
981 men
967 day
954 over
953 God,
910 she
895 among
894 these
886 did
873 put
868 thine
864 hand
853 great
847 sons
846 brought
845 down
819 you,
811 so
Vulgata:
995 tuum
993 filius
993 nec
966 suum
949 meum
930 sum
919 suis
907 contra
902 dicens
879 tui
872 quid
865 Domine
863 Hierusalem
859 suam
839 suo
835 ipse
825 omnis
811 erant
802 se
Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map<>
with boost::unordered_map<>
A more efficient approach can be done with a single map< string, int >
of occurrences, read words one by one, and increment the counter in m[ word ]
. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap<int, string>
. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...