transpose for 8 registers of 16-bit elements on SSE2/SSSE3
问题 (I'm a newbie to SSE/asm, apologies if this is obvious or redundant) Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.) Say you have registers v[0-7] and use t0-t7 as aux registers. In pseudo intrinsics code: /* Phase 1: process lower parts of the registers */ /* Level 1: work first part of the vectors */ /*