问题
I am aware that this has been asked on the forum a couple of times, I did not find any 'TAGGED' answer which could be considered the most appropriate soluion - so asking again:
We are given a very large text from book all of which cannot fit into the memory. We are required to find the top 10 most frequently occuring words in the text. What would be the most optimal (time and space) way to do this?
My thought:
Divide the file into k-sizeable chunks (such that each of the chunks can be stored in the memory). Now, perform an external sort on each of the chunks. Once we have the (N/k)- sorted files on the disk (assuming N is the total size of the text from the book) - I am not sure how should I continue so that I can obtain the top 10-elements from the k-sorted arrays.
Also, if there is a different line of thought, please suggest.
回答1:
Edit: There are problems with this algorithm, specifically that recursively merging lists makes this a polynomial-runtime algorithm. But I'll leave it here as an example of a flawed algorithm.
You cannot discard any words from your chunks because there may be one word that exists 100 times in only one chunk, and another that exists one time in each of 100 different chunks.
But you can still work with chunks, in a way similar to a MapReduce algorithm. You map each chunk to a word list (including count), then you reduce by recursively merging the word lists into one.
In the map step, map each word to a count for each chunk. Sort alphabetically, not by count and store the lists to disk. Now you can merge the lists pairwise linearly without keeping more than two words in memory:
- Let A and B be the list files to merge, and R the result file
- Read one line with word+count from A, call the word
a - Read one line with word+count from B, call the word
b - Compare the words alphabetically:
- If
a=b:- Sum their counts
- Write the word and new count to R
- Go to 2
- If
a>b:- Write
bincluding its count to R - Read a new line
bfrom B - Go to 4
- Write
- If
a<b:- Write
aincluding its count to R - Read a new line
afrom A - Go to 4
- Write
- If
Continue to do this pairwise merge until all files are merged into a single list. Then you can scan the result list once and keep the ten most frequent words.
回答2:
This is a classic problem in the field of streaming algorithms. There's clearly no way to do this that works in certain degenerate cases; you'll need to settle for a bunch of elements that are approximately (in a well-defined sense) the top k words in your stream. I don't know any classic references, but a quick Google brought me to this. It seems to have a nice survey on various techniques for doing streaming top-K. You might check the references therein for other ideas.
One other idea (and one that doesn't fly in the streaming model) is just to randomly sample as many words as will fit into memory, sort-and-uniq them, and do another pass over the file counting hits of the words in your sample. Then you can easily find the top k.
来源:https://stackoverflow.com/questions/17541983/find-the-10-most-frequently-used-words-in-a-large-book