I have the following function:
template
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
do
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd
:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef
to #define
an ALIGN(x)
macro that works on the target compiler.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd
and _mm_store_pd
both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
_mm_storeu_pd
) The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.