Why wasn't MASKMOVDQU extended to 256-bit and 512-bit stores?
问题 The MASKMOVDQU 1 is special among x86 store instructions because, in principle, it allows you to store individual bytes in a cache line, without first loading the entire cache line all the way to the core so that the written bytes can be merged with the not-overwritten existing bytes. It would seem to works using the same mechanisms as an NT store: pushing the cache line down without first doing an RFO. Per the Intel software develope manual (emphasis mine): The MASKMOVQ instruction can be