Count the number of times each word occurs in a file

匿名 (未验证) 提交于 2019-12-03 01:29:01

问题:

Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of count. I am stuck on keeping a counter to see if the first word matches the next until a new word appears. In the main I am trying to open the file, read each word by word and call sort in the while loop to sort the vector. Then, in the for loop go through all the words and if the first word equals the second count++. I don't think that is how you keep a counter.

Here is the code:

#include  #include  #include  #include  #include  #include   using namespace std;  vector lines; vector second; set words; multiset multiwords;  void readLines(const char *filename) {     string line;     ifstream infile;     infile.open(filename);     if (!infile)     {                cerr  &v, int size, int value) {     int from = 0;     int to = size - 1;     while (from  words;     string x;     ifstream inFile;     int count = 0;      inFile.open("bible.txt");     if (!inFile)      {         cout > x){         sort(words.begin(), words.end());     }      for(int i = 0;i 

回答1:

One solution could be this : define letter_only locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".

struct letter_only: std::ctype  {     letter_only(): std::ctype(get_table()) {}      static std::ctype_base::mask const* get_table()     {         static std::vector<:ctype_base::mask>              rc(std::ctype::table_size,std::ctype_base::space);          std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha);         return &rc[0];     } }; 

And then use it as:

int main() {      std::map<:string int=""> wordCount;      ifstream input;       //enable reading only english letters only!      input.imbue(std::locale(std::locale(), new letter_only()));        input.open("filename.txt");      std::string word;      std::string uppercase_word;      while(input >> word)      {          std::transform(word.begin(),                          word.end(),                          std::back_inserter(uppercase_word),                         (int(&)(int))std::toupper); //the cast is needed!          ++wordCount[uppercase_word];      }      for (std::map<:string int="">::iterator it = wordCount.begin();                                                 it != wordCount.end();                                                 ++it)      {            std::cout first                       second 


回答2:

He. I know bluntly showing a solution is not really helping you. However.

I glanced through your code and saw many unused and confused bits. Here's what I'd do:

#include  #include  #include  #include  #include  #include  #include  #include   using namespace std;  // types typedef std::pair frequency_t; typedef std::vector words_t;  // predicates static bool byDescendingFrequency(const frequency_t& a, const frequency_t& b) { return a.second > b.second; }  const struct isGTE // greater than or equal {      size_t inclusive_threshold;     bool operator()(const frequency_t& record) const          { return record.second >= inclusive_threshold; } } over1000 = { 1001 }, over800  = { 800 };  int main()  {     words_t words;     {         map tally;          ifstream inFile("bible.txt");         string s;         while (inFile >> s)             tally[s]++;          remove_copy_if(tally.begin(), tally.end(),                         back_inserter(words), over1000);     }      words_t::iterator begin = words.begin(),                       end = partition(begin, words.end(), over800);     std::sort(begin, end, &byDescendingFrequency);      for (words_t::const_iterator it=begin; it!=end; it++)         cout second first 

Authorized Verion:

993 because 981 men 967 day 954 over 953 God, 910 she 895 among 894 these 886 did 873 put 868 thine 864 hand 853 great 847 sons 846 brought 845 down 819 you, 811 so 

Vulgata:

995 tuum 993 filius 993 nec 966 suum 949 meum 930 sum 919 suis 907 contra 902 dicens 879 tui 872 quid 865 Domine 863 Hierusalem 859 suam 839 suo 835 ipse 825 omnis 811 erant 802 se 

Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map with boost::unordered_map



回答3:

A more efficient approach can be done with a single map of occurrences, read words one by one, and increment the counter in m[ word ]. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...



回答4:

Just for fun, I did a solution in c++0x style, using Boost MultiIndex.

This style would be quite clumsy without the auto keyword (type inference).

By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.

To compile and run:

g++ --std=c++0x -O3 test.cpp -o test curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz |     tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt time ./test 

.

#include  #include  #include  #include  #include  #include  #include  #include   using namespace std;  struct entry  {     string word;     size_t freq;     void increment() { freq++; } };  struct byword {}; // TAG struct byfreq {}; // TAG  int main()  {     using ::boost::lambda::_1;     using namespace ::boost::multi_index;     multi_index_container,             ordered_unique    , member >, // alphabetically             ordered_non_unique, member > // by frequency                 > > tally;      ifstream inFile("bible.txt");     string s;     while (inFile>>s)     {         auto& lookup = tally.get();         auto it = lookup.find(s);          if (lookup.end() != it)             lookup.modify(it, boost::bind(&entry::increment, _1));         else             lookup.insert({s, 1});     }      BOOST_FOREACH(auto e, tally.get().range(800 

Note how

  • it became just slightly more convenient to define a custom entry type instead of using std::pair
  • (for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:

    tally.get().range(800

The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!