I\'m trying to understand _mm256_permute2x128_si256. Is all 256 bits of register a read into the case first then is the 256 bits of register b read into the case after? Or i