This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.
We are told to do this efficien
Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head
Loop over each line with -n
Split each line into @F
words with -a
Each $_
word increments hash %h
Once the END
of file
has been reached,
sort
the hash by the frequency
Print the frequency $h{$w}
and the word $w
Use bash head
to stop at 10 lines
Using the text of this web page as input:
121 the
77 a
48 in
46 to
44 of
39 at
33 is
30 vote
29 and
25 you
I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.