Packing two DWORDs into a QWORD to save store bandwidth
问题 Imagine a load-store loop like the following which loads DWORD s from non-contiguous locations and stores them contiguously: top: mov eax, DWORD [rsi] mov DWORD [rdi], eax mov eax, DWORD [rdx] mov DWORD [rdi + 4], eax ; unroll the above a few times ; increment rdi and rsi somehow cmp ... jne top On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one