I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming
I am using t
This doesn't directly answer your question but I want point out that your SSE code is incorrectly written, I would be surprised if it works. You need to use load/store operations on non-sse types that includes aligned non-sse types like your aligned float array (you need to do this even if you have a dynamic array of SSE type). You need to keep mind that when you're working with SSE the SSE data-types are suppose to represent data in the SSE registers and every thing else is usually in system memory or non-SSE registers and thus you need to load/store from/to register and memory. This how your function should look like:
void myssefunction
(
float* pArray1, // [in] first source array
float* pArray2, // [in] second source array
float* pResult, // [out] result array
int nSize // [in] size of all arrays
)
{
const __m128 m0_5 = _mm_set_ps1(0.5f); // m0_5[0, 1, 2, 3] = 0.5
for (size_t index = 0; index < nSize; index += 4)
{
__m128 pSrc1 = _mm_load_ps(pArray1 + index); // load 4 elements from memory into SSE register
__m128 pSrc2 = _mm_load_ps(pArray2 + index); // load 4 elements from memory into SSE register
__m128 m1 = _mm_mul_ps(pSrc1, pSrc1); // m1 = *pSrc1 * *pSrc1
__m128 m2 = _mm_mul_ps(pSrc2, pSrc2); // m2 = *pSrc2 * *pSrc2
__m128 m3 = _mm_add_ps(m1, m2); // m3 = m1 + m2
__m128 m4 = _mm_sqrt_ps(m3); // m4 = sqrt(m3)
__m128 pDest = _mm_add_ps(m4, m0_5); // pDest = m4 + 0.5
_mm_store_ps(pResult + index, pDest); // store 4 elements from SSE register to memory.
}
}
Also worth noting that you have a limit of how many registers can be used in a given time (something like 16 for SSE2). You can write code that try to use more than the limit but this will cause register spilling.