This is apparently an interview question (found it in a collection of interview questions), but even if it\'s not it\'s pretty cool.
We are told to do this efficien
Let's say we assign a random prime number to each of the 26 alphabets. Then we scan the file. Whenever we find a word, we calculate its hash value(formula based on the positon & the value of the alphabets making the word). If we find this value in the hash table, then we know for sure that we are not encountering it for the first time and we increment its key value. And maintain a array of maximum 10. But If we encounter a new hash , then we store the file pointer for that hash value, and initialize the key to 0.
I have done in C# like this(a sample)
int wordFrequency = 10;
string words = "hello how r u u u u u u u u u u u u u u u u u u ? hello there u u u u ! great to c u there. hello .hello hello hello hello hello .hello hello hello hello hello hello ";
var result = (from word in words.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries)
group word by word into g
select new { Word = g.Key, Occurance = g.Count() }).ToList().FindAll(i => i.Occurance >= wordFrequency);
A Radix tree or one of it's variations will generally allow you to save storage space by collapsing common sequences.
Building it will take O(nk) - where k is "the maximum length of all strings in the set".
int k = 0;
int n = i;
int j;
string[] stringList = h.Split(" ".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries);
int m = stringList.Count();
for (j = 0; j < m; j++)
{
int c = 0;
for (k = 0; k < m; k++)
{
if (string.Compare(stringList[j], stringList[k]) == 0)
{
c = c + 1;
}
}
}
An complete solution would be something like this:
With the Trie the cost would be O(k*N), because the number of total words generally is bigger than the size of the vocabulary. Finally, since k is smaller for most of the western languages you could assume a linear complexity.
Says building a Hash and sorting the values is best. I'm inclined to agree. http://www.allinterview.com/showanswers/56657.html
Here is a Bash implementation that does something similar...I think http://www.commandlinefu.com/commands/view/5994/computes-the-most-frequent-used-words-of-a-text-file