Is strtok broken? Or just tricky?

半世苍凉 提交于 2019-11-28 01:46:54

Yes, strtok is hopelessly broken, even in a simple single-threaded program, and I will demonstrate this failure with some sample code:

Let us begin with a simple text-analyzer function to gather statistics about sentences of text, using strtok. This code will lead to undefined behavior.

In this example, a sentence is a set of words separated by spaces, commas, semi-colons, and periods.

// Example:
//     int words, longest;
//     GetSentenceStats("There were a king with a large jaw and a queen with a plain face, on the throne of England.", &words, &longest);
// will report there are 20 words, and the longest word has 7 characters ("England").
void GetSentenceStats(const char* sentence, int* pWordCount, int* pMaxWordLen)
{
    char* delims = " ,;.";           // In a sentence, words are separated by spaces, commas, semi-colons or period.
    char* input = strdup(sentence);  // Make an local copy of the sentence, to be modified without affecting the caller.

    *pWordCount = 0;                 // Initialize the output to Zero
    *pMaxWordLen = 0;

    char* word = strtok(input, delims);
    while(word)
    {
        (*pWordCount)++;
        *pMaxWordLen = MAX(*pMaxWordLen, (int)strlen(word));
        word = strtok(NULL, delims);
    }
    free(input);
}

This simple function works. There are no bugs so far.


Now let us augment our library to add a function that gathers stats on Paragraphs of text.
A paragraph is a set of sentences separated by Exclamation Marks, Question Marks and Periods.

It will return the number of sentences in the paragraph, and the number of words in the longest sentence.
And perhaps most importantly, it will use the earlier function GetSentenceStats to help

void GetParagraphStats(const char* paragraph, int* pSentenceCount, int* pMaxWords)
{
    char* delims = ".!?";             // Sentences in a paragraph are separated by Period, Question-Mark, and Exclamation.
    char* input = strdup(paragraph);  // Make an local copy of the paragraph, to be modified without affecting the caller.

    *pSentenceCount = 0;
    *pMaxWords = 0;
    char* sentence = strtok(input, delims);
    while(sentence)
    {
        (*pSentenceCount)++;

        int wordCount;
        int longestWord;
        GetSentenceStats(sentence, &wordCount, &longestWord);
        *pMaxWords = MAX(*pMaxWords, wordCount);
        sentence = strtok(NULL, delims);    // This line returns garbage data, 
    }
    free(input);
}

This function also looks very simple and straightforward.
But it does not work, as demonstrated by this sample program.

int main(void)
{
    int cnt;
    int len;

    // First demonstrate that the SentenceStats function works properly:
    char *sentence = "There were a king with a large jaw and a queen with a plain face, on the throne of England."; 
    GetSentenceStats(sentence, &cnt, &len);
    printf("Word Count: %d\nLongest Word: %d\n", cnt, len);
    // Correct Answer:
    // Word Count: 20
    // Longest Word: 7   ("England")


    printf("\n\nAt this point, expected output is 20/7.\nEverything is working fine\n\n");

    char paragraph[] =  "It was the best of times!"   // Literary purists will note I have changed Dicken's original text to make a better example
                        "It was the worst of times?"
                        "It was the age of wisdom."
                        "It was the age of foolishness."
                        "We were all going direct to Heaven!";
    int sentenceCount;
    int maxWords;
    GetParagraphStats(paragraph, &sentenceCount, &maxWords);
    printf("Sentence Count: %d\nLongest Sentence: %d\n", sentenceCount, maxWords);
    // Correct Answer:
    // Sentence Count: 5
    // Longest Sentence: 7  ("We were all going direct to Heaven")

    printf("\n\nAt the end, expected output is 5/7.\nBut Actual Output is Undefined Behavior! Strtok is hopelessly broken\n");
    _getch();
    return 0;
}

All calls to strtok are entirely correct, and are on separate data.
But the result is Undefined Behavior!

Why does this happen?
When GetParagraphStats is called, it begins a strtok-loop to get sentences. On the first sentence it will call GetSentenceStats. GetSentenceStats will also being a strtok-loop, losing all state established by GetParagraphStats. When GetSentenceStats returns, the caller (GetParagraphStats) will call strtok(NULL) again to get the next sentence.

But strtok will think this is a call to continue the previous operation, and will continue tokenizing memory that has now been freed! The result is the dreaded Undefined Behavior.

When is it safe to use strtok?
Even in a single-threaded environment, strtok can only be used safely when the programmer/architect is sure of two conditions:

  • The function using strtok must never call any function that may also use strtok.
    If it calls a subroutine that also uses strtok, its own use of strtok may be interrupted.

  • The function using strtok must never be called by any function that may also use strtok.
    If this function ever called by another routine using strtok, then this function will interrupt the callers use of strtok.

In a multi-threaded environment, use of strtok is even more impossible, because the programmer needs to be sure that there is only one use of strtok on the current thread, and also, no other threads are using strtok either.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!