Word Frequency Statistics in C (not C++)

问题

Given a string consists of words separated by a single white space, print out the words in descending order sorted by the number of times they appear in the string.

For example an input string of “ab bc bc” would generate the following output:

bc : 2
ab : 1

The problem would be easily resolved if C++ data structures, like a map, is used. But if the problem could only be solved in plain old C, it looks much harder.

What kind of data structures and algorithms shall I use here? Please be as detailed as possible. I am weak in DS and Algo. :-(

回答1:

Here's a sample of how I'd do it. The search in findWord() could be optimized. The number of allocations can also be reduced by allocating blocks of words instead of one at a time. One could implement a linked list for this case as well. It is lacking memory deallocation. This should hopefully get you going.

    #include <stdio.h>
    #include <assert.h>
    #include <stdlib.h>

    #define MAXWORDLEN 128

    const char* findWhitespace(const char* text)
    {
        while (*text && !isspace(*text))
            text++;
        return text;
    }

    const char* findNonWhitespace(const char* text)
    {
        while (*text && isspace(*text))
            text++;
        return text;
    }

    typedef struct tagWord
    {
        char word[MAXWORDLEN + 1];
        int count;
    } Word;

    typedef struct tagWordList
    {
        Word* words;
        int count;
    } WordList;

    WordList* createWordList(unsigned int count);

    void extendWordList(WordList* wordList, const int count)
    {
        Word* newWords = (Word*)malloc(sizeof(Word) * (wordList->count + count));
        if (wordList->words != NULL) {
            memcpy(newWords, wordList->words, sizeof(Word)* wordList->count);
            free(wordList->words);
        }
        for (int i = wordList->count; i < wordList->count + count; i++) {
            newWords[i].word[0] = '\0';
            newWords[i].count = 0;
        }
        wordList->words = newWords;
        wordList->count += count;
    }

    void addWord(WordList* wordList, const char* word)
    {
        assert(strlen(word) <= MAXWORDLEN);
        extendWordList(wordList, 1);
        Word* wordNode = &wordList->words[wordList->count - 1];
        strcpy(wordNode->word, word);
        wordNode->count++;  
    }

    Word* findWord(WordList* wordList, const char* word)
    {
        for(int i = 0; i < wordList->count; i++) {
            if (stricmp(word, wordList->words[i].word) == 0) {
                return &wordList->words[i];
            }
        }
        return NULL;
    }

    void updateWordList(WordList* wordList, const char* word)
    {
        Word* foundWord = findWord(wordList, word);
        if (foundWord == NULL) {
            addWord(wordList, word);
        } else {
            foundWord->count++;
        }
    }

    WordList* createWordList(unsigned int count)
    {
        WordList* wordList = (WordList*)malloc(sizeof(WordList));
        if (count > 0) {
            wordList->words = (Word*)malloc(sizeof(Word) * count);
            for(unsigned int i = 0; i < count; i++) {
                wordList->words[i].count = 0;
                wordList->words[i].word[0] = '\0';
            }
        }
        else {
            wordList->words = NULL;
        }
        wordList->count = count;    
        return wordList;
    }

    void printWords(WordList* wordList)
    {
        for (int i = 0; i < wordList->count; i++) {
            printf("%s: %d\n", wordList->words[i].word, wordList->words[i].count);
        }
    }

    int compareWord(const void* vword1, const void* vword2)
    {
        Word* word1 = (Word*)vword1;
        Word* word2 = (Word*)vword2;
        return strcmp(word1->word, word2->word);
    }

    void sortWordList(WordList* wordList)
    {
        qsort(wordList->words, wordList->count, sizeof(Word), compareWord);
    }

    void countWords(const char* text)
    {
        WordList   *wordList = createWordList(0);
        Word       *foundWord = NULL;
        const char *beg = findNonWhitespace(text);
        const char *end;
        char       word[MAXWORDLEN];

        while (beg && *beg) {
            end = findWhitespace(beg);
            if (*end) {
                assert(end - beg <= MAXWORDLEN);
                strncpy(word, beg, end - beg);
                word[end - beg] = '\0';
                updateWordList(wordList, word);
                beg = findNonWhitespace(end);
            }
            else {
                beg = NULL;
            }
        }

        sortWordList(wordList);
        printWords(wordList);
    }

int main(int argc, char* argv[])
{
    char* text = "abc 123 abc 456 def 789 \tyup this \r\ncan work yup 456 it can";
    countWords(text);
}

回答2:

One data structure you could use is a simple binary tree that contains words you could compare using strcmp. (I will ignore case issues for now).

You will need to ensure the tree remains balanced as you grow it. For this look up AVL trees or 1-2 trees or red-black trees on wikipedia or elsewhere.

I will not give too much more detail except that to create a binary tree struct, each node would have a left and right sub-node which could be null, and for a leaf node, both sub-nodes are null. To make it simpler use an "intrusive" node that has the value and two sub-nodes. Something like:

struct Node
{
  char * value;
  size_t frequency; 
  struct Node * left;
  struct Node * right;
};

and obviously being C you need to do all the memory management.

You will have a function that recurses down the tree, comparing and going left or right as appropriate. If found it will just up the frequency. If not your function should be able to determine the place at which to insert the node, and then comes your insertion and rebalancing logic. Of course the new node will contain the word with a frequency of 1.

At the end you will need a way to recurse through your tree printing the results. In your case this can be a recursive function.

Note by the way that an alternative data structure would be some kind of hash-table.

If you are looking for the most efficient solution and have a lot of memory at hand, you would use a data structure whereby you branch through each letter as you encounter it. So the "a" gives you all the words beginning with a, then move to the second letter which is the "b" etc. It is rather complicated to implement for someone who doesn't know data structures so I would advise you to go with the simple binary tree.

Note that in printing out, it would not be in reverse order of frequency so you would have to sort the results first. (In C++ using map you also would not get them in that order).

回答3:

I would use a ternary tree for this. The following article where the data structure is introduced by Jon Bentley and Robert Sedgewick has an example in C.

http://www.cs.princeton.edu/~rs/strings/

来源：https://stackoverflow.com/questions/8730887/word-frequency-statistics-in-c-not-c

标签

algorithm

data-structures

word-frequency