Compilation of a simple c++ program using SSE intrinsics

前端 未结 3 467
北荒
北荒 2020-12-16 05:17

I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming

I am using t

相关标签:
3条回答
  • 2020-12-16 05:49

    Short answer: use _mm_malloc and _mm_free from xmmintrin.h instead of _aligned_malloc and _aligned_free.

    Discussion

    You should not use _aligned_malloc, _aligned_free, posix_memalign, memalign, or whatever else when you are writing SSE/AVX code. These are all compiler/platform-specific functions (either MSVC or GCC or POSIX).

    Intel introduced functions _mm_malloc and _mm_free in Intel compiler specifically for SIMD computations (see this). The other compilers with x86 target architecture added them too (just as they add Intel intrinsics regularly). In this sense they are the only cross-platform solution: they should be available in every compiler supporting SSE.

    These functions are declared in xmmintrin.h header. Any header for later SSE/AVX version automatically includes previous ones, so it would be enough to include only smmintrin.h or emmintrin.h for example.

    0 讨论(0)
  • 2020-12-16 05:55

    This doesn't directly answer your question but I want point out that your SSE code is incorrectly written, I would be surprised if it works. You need to use load/store operations on non-sse types that includes aligned non-sse types like your aligned float array (you need to do this even if you have a dynamic array of SSE type). You need to keep mind that when you're working with SSE the SSE data-types are suppose to represent data in the SSE registers and every thing else is usually in system memory or non-SSE registers and thus you need to load/store from/to register and memory. This how your function should look like:

    void myssefunction
    (
        float* pArray1,                   // [in] first source array
        float* pArray2,                   // [in] second source array
        float* pResult,                   // [out] result array
        int nSize                         // [in] size of all arrays
    )                                   
    {
        const __m128 m0_5 = _mm_set_ps1(0.5f);        // m0_5[0, 1, 2, 3] = 0.5
        for (size_t index = 0; index < nSize; index += 4)
        {
            __m128 pSrc1 = _mm_load_ps(pArray1 + index); // load 4 elements from memory into SSE register
            __m128 pSrc2 = _mm_load_ps(pArray2 + index); // load 4 elements from memory into SSE register
    
            __m128 m1   = _mm_mul_ps(pSrc1, pSrc1);        // m1 = *pSrc1 * *pSrc1
            __m128 m2   = _mm_mul_ps(pSrc2, pSrc2);        // m2 = *pSrc2 * *pSrc2
            __m128 m3   = _mm_add_ps(m1, m2);                // m3 = m1 + m2
            __m128 m4   = _mm_sqrt_ps(m3);                   // m4 = sqrt(m3)
            __m128 pDest  = _mm_add_ps(m4, m0_5);          // pDest = m4 + 0.5
    
            _mm_store_ps(pResult + index, pDest); // store 4 elements from SSE register to memory.
        }
    }
    

    Also worth noting that you have a limit of how many registers can be used in a given time (something like 16 for SSE2). You can write code that try to use more than the limit but this will cause register spilling.

    0 讨论(0)
  • 2020-12-16 05:56

    _aligned_malloc and _aligned_free are Microsoft-isms. Use posix_memalign or memalign on Linux et al. For Mac OS X you can just use malloc, as it is always 16 byte aligned. For portable SSE code you generally want to implement wrapper functions for aligned memory allocations, e.g.

    void * malloc_simd(const size_t size)
    {
    #if defined WIN32           // WIN32
        return _aligned_malloc(size, 16);
    #elif defined __linux__     // Linux
        return memalign(16, size);
    #elif defined __MACH__      // Mac OS X
        return malloc(size);
    #else                       // other (use valloc for page-aligned memory)
        return valloc(size);
    #endif
    }
    

    Implementation of free_simd is left as an exercise for the reader.

    0 讨论(0)
提交回复
热议问题