Is reading into uninitialized memory space ALWAYS ill advised?

问题

I am recreating the entire standard C library and I'm working on an implementation for strlen that I would like to be the basis of all my other str functions.

My current implementation is as follows:

int     ft_strlen(char const *str)
{
int length;

length = 0;
while(str[length] != '\0' || str[length + 1] == '\0')
    length++;

return length;
}

My question is that when I pass a str like:

char str[6] = "hi!";

As expected, the memory reads:

['h']['i']['!']['\0']['\0']['\0']['\0']

If you look at my implementation, you can expect that I would get a return of 6 - as opposed to 3 (my previous approach) so that I can check strlen potentially including extra allocated memory.

The catch here is that I will have to read outside of initialized memory by 1 byte to fail my last loop condition at final null terminator - which is the behavior I WANT. However this is generally considered bad practice and by some an automatic error.

Is reading outside of your initialized value a bad idea even when you are very specifically intending to read into a junk value (to ensure it DOES NOT contain '\0')?

If so, why?

I understand that:

"buffer overruns are a favorite avenue for attacking secure programs"

Still, I can't see the problem if I'm simply trying to ensure I've hit the end of initialized values...

Also, I realize this problem can be avoided - I have already sidestepped with a value set to 1 and then only reading initialized values - that's not the point, this is more of a fundamental question about C, runtime behavior and best practices ;)

[EDITS:]

Comment to previous post:

OK. Fair enough - but as to the question "Is it always a bad idea (danger from intentional manipulation or runtime stability) to read after initialized values" - do you have an answer? Please read the accepted answer for an example of the nature of the question. I really don't need this code fixed, nor do I need a better understanding of data types, POSIX specs or common standards. My question is related to WHY such standards may exist - why it may be important to never read past initialized memory (if such reasons exist)? What is the potential fallout of reading past initialized values IN GENERAL?

Please all - I'm trying to better understand aspects of how systems operate and I have a VERY SPECIFIC question.

回答1:

Reading uninitialized memory can return data previously stored there. If your program processes sensitive data (such as passwords or cryptographic keys) and you disclose the uninitialized data to some party (expecting that it is valid), you might reveal confidential information.

Furthermore, if you read beyond the end of an array, the memory might not be mapped, and you will get a segmentation fault and a crash.

The compiler can also assume that your code is correct and will not read uninitialized memory, and make optimization decisions based on that, so even reading uninitialized memory can have arbitrary side effects.

回答2:

ft_strlen() can read beyond the array the string resides in. This is often undefined behavior (UB).

Even with conditions that do not read into "un-owned" memory, the result is not 6 or a value that depends on array length.

int main(void) {

  struct xx {
    char str_pre[6];
    char str[6];
    char str_post[6];
    char str_postpost[6];
  } x = { "", "Hi!", "", "x" };
  printf("%d\n", ft_strlen(x.str));  --> 11 loop was stopped by "x"

  char str[6] = "1234y";
  strcpy(str, "Hi!");
  printf("%d\n", ft_strlen(str));  --> 3  loop was stopped by "y"

  return 0;
}

ft_strlen() is not reliable code to determine array size nor string length.

Is it always a bad idea to read after initialized values?

Clarity:

char str[6] = "hi!"; initializes all 6 of str[6]. In C, there is no partial initialization - it is all or nothing.

Assignment can be partial.

char str[6];        // str uninitialized
strcpy(str, "Hi!"); // Only first 4 `char` assigned.

Reading after some initialized values implies reading into a another object or worse, outside code's accessible memory. Attempting to access is undefined behavior UB and is bad.

My question is related to WHY such standards may exist - why it may be important to never read past initialized memory.

This is really a core question about the design of C. C is a compromise. It is a language designed to work on many different platforms. To achieve that, it must be adaptable for all sorts of memory architectures. If C was to specify the result of "read after initialized values", then C would 1) seg-faulting, 2) bounds checking 3) or some other software/hardware to implement that detection. This may make C more robust at error detection, but then increase/slow emitted code. IOWs, C trusts the programmer is doing the right thing and does not try to catch such errors. An implementation might detect the issue, it might not. It is UB. C is coding on a tight-rope without a net.

What is the potential fallout of reading past initialized values IN GENERAL (?)

C does not specify the result of attempting to do such a read so there is no general result of this UB. Common results, which may vary each time the code is run, include:

A zero is read.
A consistent garbage value is read.
An inconsistent garbage value is read.
A trap value is read. (Never applies to unsigned char though.)
Seg-fault or other stoppage of code.
Code invoke a executive handler (one step in a typical hacker exploit)
Code ventures off and does something else.

回答3:

Instead of the reading uninitialized memory that's IMHO just a symptom here, let's focus on your idea and the explanation why it is wrong:

char str[6] = "hi!";
strlen(str); // evaluates to 3

This is what the C standard mandates and it's what everyone would expect. An implementation returning 6 here is just wrong. This has its reason in the way C handles arrays and strings:

Letting VLAs (variable length arrays) aside here because they're just a special case with somewhat similar rules. Then, the size of an array is fixed, in your above code, sizeof(str) is 6 and this is a compile-time constant. This size is only known where the array is in scope.

According to the specification of C, the identifier of an array evaluates to a pointer to its first element, except when used with sizeof, _Alignof or &. As one consequence, it's impossible to pass an array to a function, what you actually pass is the pointer. If you write a function to accept an array type, this type is adjusted to be a pointer type instead. ("adjusted" is the wording of the C standard, it's commonly said that the array decays as a pointer)

This specification allows C to treat an array as nothing more than a contiguous sequence of objects of the same type -- there is no metadata (like e.g. the length) stored with it.

So, if you're passing "arrays" around, therefore just having pointers to their first elements, how do you know the size of the array? There are two possibilities:

pass the size in a separate parameter of type size_t.
have a sentinel value at the end of your array.

Now, talking about strings in C: A string isn't a first-class citizen in C, it doesn't have its own type. It's defined as a sequence of char, ending with '\0'. Therefore you can store a string in a char[] and when you're working with strings, you don't need to pass lengths, because the sentinel value is already defined: every string ends with '\0'. But this also means whatever might come after a first '\0' is not part of the string.

So, with your idea, you mix up two things. You somehow want to have a function that returns the size of your array, something that isn't possible in general. You're using your array to store a string that's smaller than the array. Still, a function called strlen() is supposed to return the length of the string, which is an entirely different thing than the size of the array you use to hold your string.

You could even write something like this:

char foo[3] = "hi!";

This would initialize foo from the string constant "hi!", but foo would not contain a string, because it doesn't have the '\0' terminator. It would still be a valid char[]. But of course, you can't write a function finding out its size.

Summary: The size of an array is something completely different from the length of a string. You're mixing up both; the ill assumption that the size of an array could be determined in a function leads to code with UB, and of course, this is potentially dangerous code that could crash or worse (be exploited).

回答4:

Did you heard about the "buffer overflow problem" when you read outside the "buffer" aka the uninitialized memory a malicious code be hidden in the stack (when you read it the malicious code could be executed) more info here https://en.wikipedia.org/wiki/Buffer_overflow

therefore it is very very bad to read outside the uninitialized memory but most compiler protect that by not allowing you to do that or give you a warning to protect the stack.

回答5:

It appears you want to keep track of allocated and used string memory. There is nothing wrong with that (although its contrary to C's standard library approach). What is wrong, however, is trying to build this on a foundation that relies on UB. There are easier ways to shoot yourself in the foot.

Done right, you should rather follow a path that relies on clean code. One possible approach could be:

struct string_t
{
    int length;
    char strdata[length];
};

Then you would have to provide a suitable set of functions to deal with your own string type like

struct string_t *str_alloc(int length)
{
    struct string_t *s;

    s = malloc(sizeof(struct string_t) + length + 1);

    if (s)
        s->length = length;

    return s;
}

void str_free(struct string_t *s)
{
    free(s);
}

Might be a good exercise to go through the implementation of this with more functions like str_cat(), str_cpy() and more. This will probably also show you why the standard library does things just the way it does.

回答6:

-- Big final last edit --

So the correct "not an answer to my question" answer to my question fell into my lap today...

It turns out I am not the first person who ever thought it would be useful to be able to count available, allocated, and initialized (zero/null term/other) memory values.

The correct way to handle this situation is to bookend memory allocations for specific uses with the ASCII char 'us' (decimal: 31).

'us' is unit separator - it's purpose is to define a use-specific unit. The original IBM manual states: "its specific meaning has to be specified for each application". In our case, to signal the end of available safe write space in an array.

So my mem block should have read:

['h']['i']['!']['\0']['\0']['\0']['\0']['us']

Thus eliminating the need to EVER read outside of memory.

You're welcome person this answer is for C:

来源：https://stackoverflow.com/questions/45159019/is-reading-into-uninitialized-memory-space-always-ill-advised

标签

posix

standards

c-standard-library

data-handling