I want to use more than one ymm register to accelerate copy speed. Here is a snip of my code.
__m256 ymm[2]; ymm[0] = _mm256_load_ps(_src1);