问题 I have been using the following "trick" in C code with SSE2 for single precision floats for a while now: static inline __m128 SSEI_m128shift(__m128 data) { return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4); } For data like [1.0, 2.0, 3.0, 4.0] , it results in [2.0, 3.0, 4.0, 0.0] , i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least). I am somehow