This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.
We are told to do this efficien
You could make a time/space tradeoff and go O(n^2)
for time and O(1)
for (memory) space by counting how many times a word occurs each time you encounter it in a linear pass of the data. If the count is above the top 10 found so far, then keep the word and the count, otherwise ignore it.
Cycle through the string of words and store each in a dictionary(using python) and number of times they occur as the value.
If the word list will not fit in memory, you can split the file until it will. Generate a histogram of each part (either sequentially or in parallel), and merge the results (the details of which may be a bit fiddly if you want guaranteed correctness for all inputs, but should not compromise the O(n) effort, or the O(n/k) time for k tasks).