What makes you think that _mm_load_ps and _mm_store_ps "add a significant overhead" ? This is the normal way to load/store float data to/from SSE registers assuming source/destination is memory (and any other method eventually boils down to this anyway).