This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.
We are told to do this efficien
I think this is a typical application of counting sort since the sum of occurrences of each word is equal to the total number of words. A hash table with a counting sort should do the job in a time proportional to the number of words.
Optimizing for my own time:
sort file | uniq -c | sort -nr | head -10
Possibly followed by awk '{print $2}'
to eliminate the counts.
Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head
Loop over each line with -n
Split each line into @F
words with -a
Each $_
word increments hash %h
Once the END
of file
has been reached,
sort
the hash by the frequency
Print the frequency $h{$w}
and the word $w
Use bash head
to stop at 10 lines
Using the text of this web page as input:
121 the
77 a
48 in
46 to
44 of
39 at
33 is
30 vote
29 and
25 you
I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.
step 1 : If the file is very large and can't be sorted in memory you can split it into chunks that can be sorted in memory.
step 2 : For each sorted chunk compute sorted pairs of (words, nr_occurrence), at his point you can renounce to the chunks because you need only the sorted pairs.
step 3 : Iterate over the chunks and sort the chunks and always keep the top ten appearances.
Example:
Step 1:
a b a ab abb a a b b c c ab ab
split into :
chunk 1: a b a ab
chunk 2: abb a a b b
chunk 3: c c ab ab
Step 2:
chunk 1: a2, b1, ab1
chunk 2: a2, b2, abb1
chunk 3: c2, ab2
Step 3(merge the chunks and keep the top ten appearances):
a4 b3 ab3 c2 abb1
Depending on the size of the input data, it may or may not be a good idea to keep a HashMap. Say for instance, our hash-map is too big to fit into main memory. This can cause a very high number of memory transfers as most hash-map implementations need random access and would not be very good on the cache.
In such cases sorting the input data would be a better solution.
I think the trie data structure is a choice.
In the trie, you can record word count in each node representing frequency of word consisting of characters on the path from root to current node.
The time complexity to setup the trie is O(Ln) ~ O(n) (where L is number of characters in the longest word, which we can treat as a constant). To find the top 10 words, we can traversal the trie, which also costs O(n). So it takes O(n) to solve this problem.