Is strtok
hopelessly broken?
On many StackOverflow questions about text-parsing in C, someone will suggest using strtok
,
and one common reply is that strtok
should never be used, that it is hopelessly broken.
Some posters have claimed that strtok
's problems are limited to multi-threading issues, and it is safe in a single-threaded environment.
What is the right answer?
Does it work?
Is it hopelessly broken?
Can you back up your answer with examples?
Yes, strtok
is hopelessly broken, even in a simple single-threaded program, and I will demonstrate this failure with some sample code:
Let us begin with a simple text-analyzer function to gather statistics about sentences of text, using strtok
.
This code will lead to undefined behavior.
In this example, a sentence is a set of words separated by spaces, commas, semi-colons, and periods.
// Example:
// int words, longest;
// GetSentenceStats("There were a king with a large jaw and a queen with a plain face, on the throne of England.", &words, &longest);
// will report there are 20 words, and the longest word has 7 characters ("England").
void GetSentenceStats(const char* sentence, int* pWordCount, int* pMaxWordLen)
{
char* delims = " ,;."; // In a sentence, words are separated by spaces, commas, semi-colons or period.
char* input = strdup(sentence); // Make an local copy of the sentence, to be modified without affecting the caller.
*pWordCount = 0; // Initialize the output to Zero
*pMaxWordLen = 0;
char* word = strtok(input, delims);
while(word)
{
(*pWordCount)++;
*pMaxWordLen = MAX(*pMaxWordLen, (int)strlen(word));
word = strtok(NULL, delims);
}
free(input);
}
This simple function works. There are no bugs so far.
Now let us augment our library to add a function that gathers stats on Paragraphs of text.
A paragraph is a set of sentences separated by Exclamation Marks, Question Marks and Periods.
It will return the number of sentences in the paragraph, and the number of words in the longest sentence.
And perhaps most importantly, it will use the earlier function GetSentenceStats
to help
void GetParagraphStats(const char* paragraph, int* pSentenceCount, int* pMaxWords)
{
char* delims = ".!?"; // Sentences in a paragraph are separated by Period, Question-Mark, and Exclamation.
char* input = strdup(paragraph); // Make an local copy of the paragraph, to be modified without affecting the caller.
*pSentenceCount = 0;
*pMaxWords = 0;
char* sentence = strtok(input, delims);
while(sentence)
{
(*pSentenceCount)++;
int wordCount;
int longestWord;
GetSentenceStats(sentence, &wordCount, &longestWord);
*pMaxWords = MAX(*pMaxWords, wordCount);
sentence = strtok(NULL, delims); // This line returns garbage data,
}
free(input);
}
This function also looks very simple and straightforward.
But it does not work, as demonstrated by this sample program.
int main(void)
{
int cnt;
int len;
// First demonstrate that the SentenceStats function works properly:
char *sentence = "There were a king with a large jaw and a queen with a plain face, on the throne of England.";
GetSentenceStats(sentence, &cnt, &len);
printf("Word Count: %d\nLongest Word: %d\n", cnt, len);
// Correct Answer:
// Word Count: 20
// Longest Word: 7 ("England")
printf("\n\nAt this point, expected output is 20/7.\nEverything is working fine\n\n");
char paragraph[] = "It was the best of times!" // Literary purists will note I have changed Dicken's original text to make a better example
"It was the worst of times?"
"It was the age of wisdom."
"It was the age of foolishness."
"We were all going direct to Heaven!";
int sentenceCount;
int maxWords;
GetParagraphStats(paragraph, &sentenceCount, &maxWords);
printf("Sentence Count: %d\nLongest Sentence: %d\n", sentenceCount, maxWords);
// Correct Answer:
// Sentence Count: 5
// Longest Sentence: 7 ("We were all going direct to Heaven")
printf("\n\nAt the end, expected output is 5/7.\nBut Actual Output is Undefined Behavior! Strtok is hopelessly broken\n");
_getch();
return 0;
}
All calls to strtok
are entirely correct, and are on separate data.
But the result is Undefined Behavior!
Why does this happen?
When GetParagraphStats
is called, it begins a strtok
-loop to get sentences.
On the first sentence it will call GetSentenceStats
. GetSentenceStats
will also being a strtok
-loop, losing all state established by GetParagraphStats
.
When GetSentenceStats
returns, the caller (GetParagraphStats
) will call strtok(NULL)
again to get the next sentence.
But strtok
will think this is a call to continue the previous operation, and will continue tokenizing memory that has now been freed!
The result is the dreaded Undefined Behavior.
When is it safe to use strtok?
Even in a single-threaded environment, strtok
can only be used safely when the programmer/architect is sure of two conditions:
The function using
strtok
must never call any function that may also use strtok.
If it calls a subroutine that also uses strtok, its own use of strtok may be interrupted.The function using
strtok
must never be called by any function that may also use strtok.
If this function ever called by another routine using strtok, then this function will interrupt the callers use of strtok.
In a multi-threaded environment, use of strtok
is even more impossible, because the programmer needs to be sure that there is only one use of strtok
on the current thread, and also, no other threads are using strtok
either.
来源:https://stackoverflow.com/questions/28588170/is-strtok-broken-or-just-tricky