sse

Qt, GCC, SSE and stack alignment

泄露秘密 提交于 2019-11-28 00:16:59
问题 I'm trying to make a program compiled with GCC and using Qt and SSE intrinsics. It seems that when one of my functions is called by Qt, the stack alignment is not preserved. Here's a short example to illustrate what I mean : #include <cstdio> #include <emmintrin.h> #include <QtGui/QApplication.h> #include <QtGui/QWidget.h> class Widget: public QWidget { public: void paintEvent(QPaintEvent *) { __m128 a; printf("a: 0x%08x\n", ((void *) &a)); } }; int main(int argc, char** argv) { QApplication

how can I use SVML instructions [duplicate]

两盒软妹~` 提交于 2019-11-28 00:16:20
This question already has an answer here: C++ error: ‘_mm_sin_ps’ was not declared in this scope 3 answers Where is Clang's '_mm256_pow_ps' intrinsic? 1 answer I am trying to calculate the exponential function using SIMD. and I found this function : https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_exp_ps&expand=2136 I have already included "immintrin.h" in my code, and also my cpu has a SSE flag. But gcc is complaining that error: ‘_mm_exp_pd’ was not declared in this scope How could I check whether SVML instructions is enabled ? SVML is a proprietary Intel library that

Why does GCC or Clang not optimise reciprocal to 1 instruction when using fast-math

心已入冬 提交于 2019-11-27 23:35:43
Does anyone know why GCC/Clang will not optimist function test1 in the below code sample to simply use just the RCPPS instruction when using the fast-math option? Is there another compiler flag that would generate this code? typedef float float4 __attribute__((vector_size(16))); float4 test1(float4 v) { return 1.0f / v; } You can see the compiled output here: https://goo.gl/jXsqat Because the precision of RCPPS is a lot lower than float division. An option to enable that optimization would not be appropriate as part of -ffast-math . The x86 target options of the gcc manual says there in fact

Is it possible to cast floats directly to __m128 if they are 16 byte aligned?

喜你入骨 提交于 2019-11-27 22:55:27
Is it safe/possible/advisable to cast floats directly to __m128 if they are 16 byte aligned? I noticed using _mm_load_ps and _mm_store_ps to "wrap" a raw array adds a significant overhead. What are potential pitfalls I should be aware of? EDIT : There is actually no overhead in using the load and store instructions, I got some numbers mixed and that is why I got better performance. Even thou I was able to do some HORRENDOUS mangling with raw memory addresses in a __m128 instance, when I ran the test it took TWICE AS LONG to complete without the _mm_load_ps instruction, probably falling back to

horizontal sum of 8 packed 32bit floats

你。 提交于 2019-11-27 22:51:36
问题 If I have 8 packed 32-bit floating point numbers ( __m256 ), what's the fastest way to extract the horizontal sum of all 8 elements? Similarly, how to obtain the horizontal maximum and minimum? In other words, what's the best implementation for the following C++ functions? float sum(__m256 x); ///< returns sum of all 8 elements float max(__m256 x); ///< returns the maximum of all 8 elements float min(__m256 x); ///< returns the minimum of all 8 elements 回答1: Quickly jotted down here (and

SSE reduction of float vector

不想你离开。 提交于 2019-11-27 22:27:44
How can I get sum elements (reduction) of float vector using sse intrinsics? Simple serial code: void(float *input, float &result, unsigned int NumElems) { result = 0; for(auto i=0; i<NumElems; ++i) result += input[i]; } Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g. #include <cassert> #include <cstdint> #include <emmintrin.h> float vsum(const float *a, int n) { float sum; __m128 vsum = _mm_set1_ps(0.0f); assert((n & 3) == 0); assert(((uintptr_t)a & 15) == 0); for (int i = 0; i < n; i += 4) { __m128 v = _mm_load_ps(

A better 8x8 bytes matrix transpose with SSE?

為{幸葍}努か 提交于 2019-11-27 22:14:30
问题 I found this post that explains how to transpose an 8x8 bytes matrix with 24 operations, and a few scrolls later there's the code that implements the transpose. However, this method does not exploit the fact that we can block the 8x8 transpose into four 4x4 transposes, and each one can be done in one shuffle instruction only (this post is the reference). So I came out with this solution: __m128i transpose4x4mask = _mm_set_epi8(15, 11, 7, 3, 14, 10, 6, 2, 13, 9, 5, 1, 12, 8, 4, 0); __m128i

What's the difference between logical SSE intrinsics?

。_饼干妹妹 提交于 2019-11-27 21:20:46
Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several

Difference between MOVDQA and MOVAPS x86 instructions?

大憨熊 提交于 2019-11-27 21:07:31
I'm looking Intel datasheet: Intel® 64 and IA-32 Architectures Software Developer’s Manual and I can't find the difference between MOVDQA : Move Aligned Double Quadword MOVAPS : Move Aligned Packed Single-Precision In Intel datasheet I can find for both instructions: This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. The only difference is: To move a double quadword to or from unaligned memory locations, use the MOVDQU instruction. and To move

What does ordered / unordered comparison mean?

为君一笑 提交于 2019-11-27 21:06:47
问题 Looking at the SSE operators CMPORDPS - ordered compare packed singles CMPUNORDPS - unordered compare packed singles What do ordered and unordered mean? I looked for equivalent instructions in the x86 instruction set, and it only seems to have unordered (FUCOM). 回答1: An ordered comparison checks if neither operand is NaN . Conversely, an unordered comparison checks if either operand is a NaN . This page gives some more information on this: http://csapp.cs.cmu.edu/public/waside/waside-sse.pdf