Given a file, find the ten most frequently occurring words as efficiently as possible

后端 未结 15 1684
予麋鹿
予麋鹿 2020-12-12 13:26

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

相关标签:
15条回答
  • 2020-12-12 13:51

    I think this is a typical application of counting sort since the sum of occurrences of each word is equal to the total number of words. A hash table with a counting sort should do the job in a time proportional to the number of words.

    0 讨论(0)
  • 2020-12-12 13:56

    Optimizing for my own time:

    sort file | uniq -c | sort -nr | head -10
    

    Possibly followed by awk '{print $2}' to eliminate the counts.

    0 讨论(0)
  • 2020-12-12 13:57

    Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:

    perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head

    Loop over each line with -n
    Split each line into @F words with -a
    Each $_ word increments hash %h
    Once the END of file has been reached,
    sort the hash by the frequency
    Print the frequency $h{$w} and the word $w
    Use bash head to stop at 10 lines

    Using the text of this web page as input:

    121     the
    77      a
    48      in
    46      to
    44      of
    39      at
    33      is
    30      vote
    29      and
    25      you
    

    I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
    Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.

    0 讨论(0)
  • 2020-12-12 13:58

    step 1 : If the file is very large and can't be sorted in memory you can split it into chunks that can be sorted in memory.

    step 2 : For each sorted chunk compute sorted pairs of (words, nr_occurrence), at his point you can renounce to the chunks because you need only the sorted pairs.

    step 3 : Iterate over the chunks and sort the chunks and always keep the top ten appearances.

    Example:

    Step 1:

    a b a ab abb a a b b c c ab ab

    split into :

    chunk 1: a b a ab
    chunk 2: abb a a b b
    chunk 3: c c ab ab

    Step 2:

    chunk 1: a2, b1, ab1 chunk 2: a2, b2, abb1
    chunk 3: c2, ab2

    Step 3(merge the chunks and keep the top ten appearances):

    a4 b3 ab3 c2 abb1

    0 讨论(0)
  • 2020-12-12 13:59

    Depending on the size of the input data, it may or may not be a good idea to keep a HashMap. Say for instance, our hash-map is too big to fit into main memory. This can cause a very high number of memory transfers as most hash-map implementations need random access and would not be very good on the cache.

    In such cases sorting the input data would be a better solution.

    0 讨论(0)
  • 2020-12-12 14:00

    I think the trie data structure is a choice.

    In the trie, you can record word count in each node representing frequency of word consisting of characters on the path from root to current node.

    The time complexity to setup the trie is O(Ln) ~ O(n) (where L is number of characters in the longest word, which we can treat as a constant). To find the top 10 words, we can traversal the trie, which also costs O(n). So it takes O(n) to solve this problem.

    0 讨论(0)
提交回复
热议问题