Why can't 64-bit Windows unwind user-kernel-user exceptions?

问题

Why can't 64-bit Windows unwind the stack during an exception, if the stack crosses the kernel boundary - when 32-bit Windows can?

The context of this entire question comes from:

The case of the disappearing OnLoad exception – user-mode callback exceptions in x64

Background

In 32-bit Windows, if i throw an exception in my user mode code, that was called back from kernel mode code, that was called from my user mode code, e.g:

User mode                     Kernel Mode
------------------            -------------------
CreateWindow(...);   ------>  NtCreateWindow(...)
                                   |
WindowProc   <---------------------+

the Structured Exception Handling (SEH) in Windows can unwind the stack, unwinding back through kernel mode, back into my user code, where i can handle the exception and i see a valid stack trace.

But not in 64-bit Windows

64-bit editions of Windows cannot do this:

For complicated reasons, we cannot propagate the exception back on 64-bit operating systems (amd64 and IA64). This has been the case ever since the first 64-bit release of Server 2003. On x86, this isn’t the case – the exception gets propagated through the kernel boundary and would end up walking the frames back

And since there's no way to walk back a reliable stack trace in this case, the had to make a decision: let you see the non-nonsensical exception, or hide it altogether:

The kernel architects at the time decided to take the conservative AppCompat-friendly approach – hide the exception, and hope for the best.

The article goes on to talk about how this was how all 64-bit Windows operating systems behaved:

Windows XP 64-bit
Windows Server 2003 64-bit
Windows Vista 64-bit
Windows Server 2008 64-bit

But starting with Windows 7 (and Windows Server 2008), the architects changed their minds - sort of. For only 64-bit applications (not 32-bit applications), they would (by default) stop suppressing these user-kernel-user exceptions. So, by default, on:

Windows 7 64-bit
Windows Server 2008

all 64-bit applications will see these exceptions, where they never used to see them.

In Windows 7, when a native x64 application crashes in this fashion, the Program Compatibility Assistant is notified. If the application doesn’t have a Windows 7 Manifest, we show a dialog telling you that PCA has applied an Application Compatibility shim. What does this mean? This means, that the next time you run your application, Windows will emulate the Server 2003 behavior and make the exception disappear. Keep in mind, that PCA doesn’t exist on Server 2008 R2, so this advice doesn’t apply.

So the question

The question is why is 64-bit Windows unable to unwind a stack back through a kernel transition, while 32-bit editions of Windows can?

The only hint is:

For complicated reasons, we cannot propagate the exception back on 64-bit operating systems (amd64 and IA64).

The hint is it's complicated.

i may not understand the explanation, as i'm not an operating system developer - but i'd like a shot at knowing why.

Update: Hotfix to stop suppressing 32-bit apps

Microsoft has released a hotfix enables 32-bit applications to also no longer have the exceptions suppressed:

KB976038: Exceptions that are thrown from an application that runs in a 64-bit version of Windows are ignored

An exception that is thrown in a callback routine runs in the user mode.

In this scenario, this exception does not cause the application to crash. Instead, the application enters into an inconsistent state. Then, the application throws a different exception and crashes.

A user mode callback function is typically an application-defined function that is called by a kernel mode component. Examples of user mode callback functions are Windows procedures and hook procedures. These functions are called by Windows to process Windows messages or to process Windows hook events.

The hotfix then lets you stop Windows from eating the exceptions globally:

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options
DisableUserModeCallbackFilter: DWORD = 1

or per-application:

HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\Notepad.exe
DisableUserModeCallbackFilter: DWORD = 1

The behavior was also documented on XP and Server 2003 in KB973460:

Exceptions that are thrown from a 64-bit application that is running in the 64-bit editions of Windows Server 2003 or of Windows XP Professional are silently ignored

A hint

i found another hint when investigating using xperf to capture stack traces on 64-bit Windows:

Stack Walking in Xperf

Disable Paging Executive

In order for tracing to work on 64-bit Windows you need to set the DisablePagingExecutive registry key. This tells the operating system not to page kernel mode drivers and system code to disk, which is a prerequisite for getting 64-bit call stacks using xperf, because 64-bit stack walking depends on metadata in the executable images, and in some situations the xperf stack walk code is not allowed to touch paged out pages. Running the following command from an elevated command prompt will set this registry key for you.
 REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v 
 DisablePagingExecutive -d 0x1 -t REG_DWORD -f
After setting this registry key you will need to reboot your system before you can record call stacks. Having this flag set means that the Windows kernel locks more pages into RAM, so this will probably consume about 10 MB of additional physical memory.

This gives the impression that in 64-bit Windows (and only in 64-bit Windows), you are not allowed to walk kernel stacks because there might be pages out on disk.

回答1:

I'm the developer who wrote this Hotfix a loooooooong time ago as well as the blog post. The main reason is that the full register file isn't always captured when you transition into kernel space, for performance reasons.

If you make a normal syscall, the x64 Application Binary Interface (ABI) only requires you to preserve the non-volatile registers (similar to making a normal function call). However, correctly unwinding the exception requires you to have all the registers, so it's not possible. Basically, this was a choice between perf in a critical scenario (i.e. a scenario that potentially happens thousands of times per second) vs. 100% correctly handling a pathological scenario (a crash).

Bonus Reading

Overview of x64 Calling Conventions
x86 Software Conventions - Register Usage

回答2:

A very good question.

I can give a hint of why "propagating" an exception across kernel-user boundary is somewhat problematic.

Citation from your question:

Why can't 64-bit Windows unwind the stack during an exception, if the stack crosses the kernel boundary - when 32-bit Windows can?

The reason is very simple: There's no such a thing as "stack crosses kernel boundary". Calling a kernel-mode function is by no means comparable to a standard function call. It has nothing to do with the call stack actually. As you probably know, kernel-mode memory is simply inaccessible from the user mode.

Invoking a kernel-mode function (aka syscall) is implemented by triggering a software interrupt (or a similar mechanism). A user-mode code puts some values into registers (that identify the needed kernel-mode service) and invokes a CPU instruction (such as sysenter) which transfers the CPU into kernel-mode and passes the control to the OS.

Then there's a kernel-mode code that handles the requested syscall. It runs in a separate kernel-mode stack (that has nothing to do with the user-mode stack). After the request was handled - the control is returned to the user-mode code. Depending on the specific syscall the user-mode return address may be the one that invoked the kernel-mode transaction, as well as it may be different address.

Sometimes you call a kernel-mode function that "in the middle" should invoke a user-mode call. It may look like a call stack consisting of a user-kernel-user code, but it's just an emulation. In such a case the kernel-mode code transfers the control to a user-mode code which wraps your user-mode function. This wrapper code calls your function, and immediately upon its return triggers a kernel-mode transaction.

Now, if the user mode code "invoked from the kernelmode" raises an exception - this is what should happen:

The wrapper user-mode code handles the SEH exception (i.e. stops its propagation, but doesn't perform the stack unwinding yet).
Passes the control to kernel-mode (OS), as in a normal program flow case.
Kenrel-mode code responds appropriately. It finishes the requested service. Depending on whether there was a user-mode exception - the processing may be different.
Upon return to user-mode - the kernel-mode code may specify if there was a nested exception. In case of an exception the stack is not restored to its original state (since there was no unwinding yet).
User-mode code checks if there was such an exception. If it was - the call stack is forged to include the nested user-mode call, and the exception propagates.

So that exception that crosses kernel-user boundary is an emulation. There's no such a thing natively.

来源：https://stackoverflow.com/questions/11376795/why-cant-64-bit-windows-unwind-user-kernel-user-exceptions

标签

windows

64-bit

windows64

structured-exception

windows-appcompat-platform