Compilation of a simple c++ program using SSE intrinsics

前端 未结 3 469
北荒
北荒 2020-12-16 05:17

I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming

I am using t

3条回答
  •  北海茫月
    2020-12-16 05:55

    This doesn't directly answer your question but I want point out that your SSE code is incorrectly written, I would be surprised if it works. You need to use load/store operations on non-sse types that includes aligned non-sse types like your aligned float array (you need to do this even if you have a dynamic array of SSE type). You need to keep mind that when you're working with SSE the SSE data-types are suppose to represent data in the SSE registers and every thing else is usually in system memory or non-SSE registers and thus you need to load/store from/to register and memory. This how your function should look like:

    void myssefunction
    (
        float* pArray1,                   // [in] first source array
        float* pArray2,                   // [in] second source array
        float* pResult,                   // [out] result array
        int nSize                         // [in] size of all arrays
    )                                   
    {
        const __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5
        for (size_t index = 0; index < nSize; index += 4)
        {
            __m128 pSrc1 = _mm_load_ps(pArray1 + index); // load 4 elements from memory into SSE register
            __m128 pSrc2 = _mm_load_ps(pArray2 + index); // load 4 elements from memory into SSE register
    
            __m128 m1   = _mm_mul_ps(pSrc1, pSrc1);        // m1 = *pSrc1 * *pSrc1
            __m128 m2   = _mm_mul_ps(pSrc2, pSrc2);        // m2 = *pSrc2 * *pSrc2
            __m128 m3   = _mm_add_ps(m1, m2);                // m3 = m1 + m2
            __m128 m4   = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
            __m128 pDest  = _mm_add_ps(m4, m0_5);          // pDest = m4 + 0.5
    
            _mm_store_ps(pResult + index, pDest); // store 4 elements from SSE register to memory.
        }
    }
    

    Also worth noting that you have a limit of how many registers can be used in a given time (something like 16 for SSE2). You can write code that try to use more than the limit but this will cause register spilling.

提交回复
热议问题