可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Hi I am writing a program that counts the number of times each word occurs in a file. Then it prints a list of words with counts between 800 and 1000, sorted in the order of count. I am stuck on keeping a counter to see if the first word matches the next until a new word appears. In the main I am trying to open the file, read each word by word and call sort in the while loop to sort the vector. Then, in the for loop go through all the words and if the first word equals the second count++. I don't think that is how you keep a counter.
Here is the code:
#include #include #include #include #include #include using namespace std; vector lines; vector second; set words; multiset multiwords; void readLines(const char *filename) { string line; ifstream infile; infile.open(filename); if (!infile) { cerr &v, int size, int value) { int from = 0; int to = size - 1; while (from words; string x; ifstream inFile; int count = 0; inFile.open("bible.txt"); if (!inFile) { cout > x){ sort(words.begin(), words.end()); } for(int i = 0;i
回答1:
One solution could be this : define letter_only
locale so as to ignore punctuations coming from the stream, and to read only valid "english" letters from the input stream. That way, the stream will treat the words "ways", "ways." and "ways!" as just the same word "ways", because the stream will ignore punctuations like "." and "!".
struct letter_only: std::ctype { letter_only(): std::ctype(get_table()) {} static std::ctype_base::mask const* get_table() { static std::vector<:ctype_base::mask> rc(std::ctype::table_size,std::ctype_base::space); std::fill(&rc['A'], &rc['z'+1], std::ctype_base::alpha); return &rc[0]; } };
And then use it as:
int main() { std::map<:string int=""> wordCount; ifstream input; //enable reading only english letters only! input.imbue(std::locale(std::locale(), new letter_only())); input.open("filename.txt"); std::string word; std::string uppercase_word; while(input >> word) { std::transform(word.begin(), word.end(), std::back_inserter(uppercase_word), (int(&)(int))std::toupper); //the cast is needed! ++wordCount[uppercase_word]; } for (std::map<:string int="">::iterator it = wordCount.begin(); it != wordCount.end(); ++it) { std::cout first second
回答2:
He. I know bluntly showing a solution is not really helping you. However.
I glanced through your code and saw many unused and confused bits. Here's what I'd do:
#include #include #include #include #include #include
Authorized Verion:
993 because 981 men 967 day 954 over 953 God, 910 she 895 among 894 these 886 did 873 put 868 thine 864 hand 853 great 847 sons 846 brought 845 down 819 you, 811 so
Vulgata:
995 tuum 993 filius 993 nec 966 suum 949 meum 930 sum 919 suis 907 contra 902 dicens 879 tui 872 quid 865 Domine 863 Hierusalem 859 suam 839 suo 835 ipse 825 omnis 811 erant 802 se
Performance is about 1.12s for for both files, but only 0.355s after drop-in replacing map
with boost::unordered_map
回答3:
A more efficient approach can be done with a single map
of occurrences, read words one by one, and increment the counter in m[ word ]
. After all words have been accounted for, iterate over the map, for words in the given range, add them to a multimap
. Finally dump the contents of the multimap, that will be ordered by number of occurrences and alphabetical order...
回答4:
Just for fun, I did a solution in c++0x style, using Boost MultiIndex.
This style would be quite clumsy without the auto
keyword (type inference).
By maintaining the indexes by word and by frequency at all times, there is no need to remove, partition, nor sort the wordlist: it'll all be there.
To compile and run:
g++ --std=c++0x -O3 test.cpp -o test curl ftp://ftp.funet.fi/pub/doc/bible/texts/english/av.tar.gz | tar xzO | sed 's/^[ 0-9:]\+//' > bible.txt time ./test
.
#include #include #include #include #include #include #include #include using namespace std; struct entry { string word; size_t freq; void increment() { freq++; } }; struct byword {}; // TAG struct byfreq {}; // TAG int main() { using ::boost::lambda::_1; using namespace ::boost::multi_index; multi_index_container, ordered_unique , member >, // alphabetically ordered_non_unique, member > // by frequency > > tally; ifstream inFile("bible.txt"); string s; while (inFile>>s) { auto& lookup = tally.get(); auto it = lookup.find(s); if (lookup.end() != it) lookup.modify(it, boost::bind(&entry::increment, _1)); else lookup.insert({s, 1}); } BOOST_FOREACH(auto e, tally.get().range(800
Note how
- it became just slightly more convenient to define a custom
entry
type instead of using std::pair
(for obvious reasons), this is slower than my earlier code: this maintains the index by frequency during the insertion phase. This is unnecessary, but it makes for much more efficient extraction of the [800,1000] range:
tally.get().range(800
The multi-set of frequencies is already ordered. So, the actual speed/memory trade of might tip in the favour of this version, especially when documents would be large and contain very few duplicated words (alas, this is a property known not to hold for the corpus text of the bible, lest someone translate it to neologorrhea).