Given a file, find the ten most frequently occurring words as efficiently as possible

后端 未结 15 1687
予麋鹿
予麋鹿 2020-12-12 13:26

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

15条回答
  •  时光取名叫无心
    2020-12-12 13:57

    Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:

    perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head

    Loop over each line with -n
    Split each line into @F words with -a
    Each $_ word increments hash %h
    Once the END of file has been reached,
    sort the hash by the frequency
    Print the frequency $h{$w} and the word $w
    Use bash head to stop at 10 lines

    Using the text of this web page as input:

    121     the
    77      a
    48      in
    46      to
    44      of
    39      at
    33      is
    30      vote
    29      and
    25      you
    

    I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
    Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.

提交回复
热议问题