Given a file, find the ten most frequently occurring words as efficiently as possible

后端 未结 15 1686
予麋鹿
予麋鹿 2020-12-12 13:26

This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.

We are told to do this efficien

相关标签:
15条回答
  • 2020-12-12 14:00

    Let's say we assign a random prime number to each of the 26 alphabets. Then we scan the file. Whenever we find a word, we calculate its hash value(formula based on the positon & the value of the alphabets making the word). If we find this value in the hash table, then we know for sure that we are not encountering it for the first time and we increment its key value. And maintain a array of maximum 10. But If we encounter a new hash , then we store the file pointer for that hash value, and initialize the key to 0.

    0 讨论(0)
  • 2020-12-12 14:07

    I have done in C# like this(a sample)

    int wordFrequency = 10;
    string words = "hello how r u u u u  u  u u  u  u u u  u u u u  u u u ? hello there u u u u ! great to c u there. hello .hello hello hello hello hello .hello hello hello hello hello hello ";            
    
    var result = (from word in words.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)
                              group word by word into g
                              select new { Word = g.Key, Occurance = g.Count() }).ToList().FindAll(i => i.Occurance >= wordFrequency);
    
    0 讨论(0)
  • 2020-12-12 14:07

    A Radix tree or one of it's variations will generally allow you to save storage space by collapsing common sequences.
    Building it will take O(nk) - where k is "the maximum length of all strings in the set".

    0 讨论(0)
  • 2020-12-12 14:08
        int k = 0;
        int n = i;
        int j;
        string[] stringList = h.Split(" ".ToCharArray(),
                                      StringSplitOptions.RemoveEmptyEntries);
        int m = stringList.Count();
        for (j = 0; j < m; j++)
        {
            int c = 0;
            for (k = 0; k < m; k++)
            {
                if (string.Compare(stringList[j], stringList[k]) == 0)
                {
                    c = c + 1;
                }
            }
        }
    
    0 讨论(0)
  • 2020-12-12 14:11

    An complete solution would be something like this:

    1. Do an external sort O(N log N)
    2. Count the word freq in the file O(N)
    3. (An alternate would be the use of a Trie as @Summer_More_More_Tea to count the frequencies, if you can afford that amount of memory) O(k*N) //for the two first steps
    4. Use a min-heap:
      • Put the first n elements on the heap
      • For every word left add it to the heap and delete the new min in heap
      • In the end the heap Will contain the n-th most common words O(|words|*log(n))

    With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.

    0 讨论(0)
  • 2020-12-12 14:11

    Says building a Hash and sorting the values is best. I'm inclined to agree. http://www.allinterview.com/showanswers/56657.html

    Here is a Bash implementation that does something similar...I think http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file

    0 讨论(0)
提交回复
热议问题