In multi-threaded embedded software (written in C or C++), a thread must be given enough stack space in order to allow it to complete its operations without overflowing. Co
Static (off-line) stack checking is not as difficult as it seems. I have implemented it for our embedded IDE (RapidiTTy) — it currently works for ARM7 (NXP LPC2xxx), Cortex-M3 (STM32 and NXP LPC17xx), x86 and our in-house MIPS ISA compatible FPGA soft-core.
Essentially we use a simple parse of the executable code to determine the stack usage of each function. Most significant stack allocation is done at the start of each function; just be sure to see how it alters with different optimisation levels and, if applicable, ARM/Thumb instruction sets, etc. Remember also that tasks usually have their own stacks, and ISRs often (but not always) share a separate stack area!
Once you have the usage of each function, it's fairly easy to build up a call-tree from the parse and calculate the maximum usage for every function. Our IDE generates schedulers (effective thin RTOSes) for you, so we know exactly which functions are being designated as 'tasks' and which are ISRs, so we can tell the worst-case usage for each stack area.
Of course, these figures are almost always over the actual maximum. Think of a function like sprintf that can use a lot of stack space, but varies enormously depending on the format string and parameters that you provide. For these situations you can also use dynamic analysis — fill the stack with a known value in your startup, then run in the debugger for a while, pause and see how much of each stack is still filled with your value (high watermark style testing).
Neither approach is perfect, but combining both will give you a fairly good picture of what the real-world usage is going to be like.