Given a file, find the ten most frequently occurring words as efficiently as possible

后端未结

关注

 15  1692

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

相关标签:

15条回答

眼角桃花

2020-12-12 13:51

I think this is a typical application of counting sort since the sum of occurrences of each word is equal to the total number of words. A hash table with a counting sort should do the job in a time proportional to the number of words.

0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2020-12-12 13:56
Optimizing for my own time:
```
sort file | uniq -c | sort -nr | head -10
```
Possibly followed by awk '{print $2}' to eliminate the counts.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-12 13:57
Not the most efficient CPU-wise, and UGLY, but it took only 2 minutes to bang out:

perl -lane '$h{$_}++ for @F; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' file | head

Loop over each line with -n
Split each line into @F words with -a
Each $_ word increments hash %h
Once the END of file has been reached,
sort the hash by the frequency
Print the frequency $h{$w} and the word $w
Use bash head to stop at 10 lines

Using the text of this web page as input:
```
121     the
77      a
48      in
46      to
44      of
39      at
33      is
30      vote
29      and
25      you
```
I benchmarked this solution vs the top-rated shell solution (ben jackson) on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 171 seconds, while the shell solution completed in 474 seconds.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-12 13:58

step 1 : If the file is very large and can't be sorted in memory you can split it into chunks that can be sorted in memory.

step 2 : For each sorted chunk compute sorted pairs of (words, nr_occurrence), at his point you can renounce to the chunks because you need only the sorted pairs.

step 3 : Iterate over the chunks and sort the chunks and always keep the top ten appearances.

Example:

Step 1:

a b a ab abb a a b b c c ab ab

split into :

chunk 1: a b a ab
chunk 2: abb a a b b
chunk 3: c c ab ab

Step 2:

chunk 1: a2, b1, ab1 chunk 2: a2, b2, abb1
chunk 3: c2, ab2

Step 3(merge the chunks and keep the top ten appearances):

a4 b3 ab3 c2 abb1

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-12-12 13:59

Depending on the size of the input data, it may or may not be a good idea to keep a HashMap. Say for instance, our hash-map is too big to fit into main memory. This can cause a very high number of memory transfers as most hash-map implementations need random access and would not be very good on the cache.

In such cases sorting the input data would be a better solution.

0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-12-12 14:00

I think the trie data structure is a choice.

In the trie, you can record word count in each node representing frequency of word consisting of characters on the path from root to current node.

The time complexity to setup the trie is O(Ln) ~ O(n) (where L is number of characters in the longest word, which we can treat as a constant). To find the top 10 words, we can traversal the trie, which also costs O(n). So it takes O(n) to solve this problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页