Given a file, find the ten most frequently occurring words as efficiently as possible

后端未结

关注

 15  1693

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

相关标签:

15条回答

傲寒

2020-12-12 14:00

Let's say we assign a random prime number to each of the 26 alphabets. Then we scan the file. Whenever we find a word, we calculate its hash value(formula based on the positon & the value of the alphabets making the word). If we find this value in the hash table, then we know for sure that we are not encountering it for the first time and we increment its key value. And maintain a array of maximum 10. But If we encounter a new hash , then we store the file pointer for that hash value, and initialize the key to 0.

0 讨论(0)
发布评论:

提交评论
- 加载中...

心在旅途

2020-12-12 14:07

I have done in C# like this(a sample)

int wordFrequency = 10;
string words = "hello how r u u u u  u  u u  u  u u u  u u u u  u u u ? hello there u u u u ! great to c u there. hello .hello hello hello hello hello .hello hello hello hello hello hello ";            

var result = (from word in words.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)
                          group word by word into g
                          select new { Word = g.Key, Occurance = g.Count() }).ToList().FindAll(i => i.Occurance >= wordFrequency);

0 讨论(0)

无人共我

2020-12-12 14:07

A Radix tree or one of it's variations will generally allow you to save storage space by collapsing common sequences.
Building it will take O(nk) - where k is "the maximum length of all strings in the set".

0 讨论(0)
发布评论:

提交评论
- 加载中...

情深已故

2020-12-12 14:08

    int k = 0;
    int n = i;
    int j;
    string[] stringList = h.Split(" ".ToCharArray(),
                                  StringSplitOptions.RemoveEmptyEntries);
    int m = stringList.Count();
    for (j = 0; j < m; j++)
    {
        int c = 0;
        for (k = 0; k < m; k++)
        {
            if (string.Compare(stringList[j], stringList[k]) == 0)
            {
                c = c + 1;
            }
        }
    }

0 讨论(0)

无人及你

2020-12-12 14:11
An complete solution would be something like this:
1. Do an external sort O(N log N)
2. Count the word freq in the file O(N)
3. (An alternate would be the use of a Trie as @Summer_More_More_Tea to count the frequencies, if you can afford that amount of memory) O(k*N) //for the two first steps
4. Use a min-heap:
  - Put the first n elements on the heap
  - For every word left add it to the heap and delete the new min in heap
  - In the end the heap Will contain the n-th most common words O(|words|*log(n))
With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-12-12 14:11

Says building a Hash and sorting the values is best. I'm inclined to agree. http://www.allinterview.com/showanswers/56657.html

Here is a Bash implementation that does something similar...I think http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file

0 讨论(0)
发布评论:

提交评论
- 加载中...