Given a file, find the ten most frequently occurring words as efficiently as possible

后端 未结 15 1703
予麋鹿
予麋鹿 2020-12-12 13:26

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

15条回答
  •  无人及你
    2020-12-12 14:11

    An complete solution would be something like this:

    1. Do an external sort O(N log N)
    2. Count the word freq in the file O(N)
    3. (An alternate would be the use of a Trie as @Summer_More_More_Tea to count the frequencies, if you can afford that amount of memory) O(k*N) //for the two first steps
    4. Use a min-heap:
      • Put the first n elements on the heap
      • For every word left add it to the heap and delete the new min in heap
      • In the end the heap Will contain the n-th most common words O(|words|*log(n))

    With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.

提交回复
热议问题