It seems to be a recurring problem that many Intel processors (up until Skylake, unless I\'m wrong) exhibit poor performance when mixing AVX-256 instructions with SSE instru
In order to move external resources in-site, I've extracted the relevant paragraphs from the link Michael provided in the comments.
All credits go to him.
The link points to a very similar question Agner Fog asked on the Intel's forum.
[Fog in respone to Intel's answer] If I understand you right, you decided that it is necessary to have two versions of all 128-bit instructions in order to avoid destroying the upper part of the YMM registers in case an interrupt calls a device driver using legacy XMM instructions.
Intel were concerned that by making legacy SSE instructions zeroing the upper part of the XMM registers the ISRs would now suddenly
affect the new YMM registers.
Without support for saving the new YMM context this would make the use of AVX impossible under any
circumstances.
However Fog was not completely satisfied and pointed out that by simply recompiling a driver with an AVX aware compiler (so that VEX instruction were used) would result in the same outcome.
Intel replied that their goal was to avoid forcing existing software to be rewritten.
There is no way we could compel the industry to rewrite/fix all of their existing drivers (for example to use XSAVE) and no way to guarantee they would have done so successfully. Consider for example the pain the industry is still going through on the transition from 32 to 64-bit operating systems! The feedback we have from OS vendors also precluded adding overhead to the ISR servicing to add the state management overhead on every interrupt. We didn't want to inflict either of these costs on portions of the industry that don't even typically use wide vectors.
By having two versions of the instructions, support for AVX in drivers can be achieved like it has been for FPU/SSE:
The example given is similar to the current scenario where a ring-0 driver (ISR) vendor attempts to use floating-point state, or accidentally links it in some library, in OSs that do not automatically manage that context at Ring-0. This is a well known source of bugs and I can suggest only the following:
On those OSs, driver developers are discouraged from using floating-point or AVX
Driver developers should be encouraged to disable hardware features during driver validation (i.e. AVX state can be disabled by drivers in Ring-0 through XSETBV()
Background: the decision was made early to make KeSaveFloatingPointState do nothing on Windows x64 and to allow XMM registers to be used without extra save/restore calls even in drivers. Obviously these drivers would not be aware of AVX or the YMM registers.