sse

Is an __m128i variable zero?

给你一囗甜甜゛ 提交于 2019-12-28 21:50:53
问题 How do I test if a __m128i variable has any nonzero value on SSE-2-and-earlier processors? 回答1: In SSE2 you can do: __m128i zero = _mm_setzero_si128(); if(_mm_movemask_epi8(_mm_cmpeq_epi32(x,zero)) == 0xFFFF) { //the code... } this will test four int's vs zero then return a mask for each byte, so your bit-offsets of each corresponding int would be at 0, 4, 8 & 12, but the above test will catch if any bit is set, then if you preserve the mask you can work with the finer grained parts directly

Does Java strictfp modifier have any effect on modern CPUs?

独自空忆成欢 提交于 2019-12-28 14:57:31
问题 I know the meaning of the strictfp modifier on methods (and on classes), according to the JLS: JLS 8.4.3.5, strictfp methods: The effect of the strictfp modifier is to make all float or double expressions within the method body be explicitly FP-strict (§15.4). JLS 15.4 FP-strict expressions: Within an FP-strict expression, all intermediate values must be elements of the float value set or the double value set, implying that the results of all FP-strict expressions must be those predicted by

SSE-copy, AVX-copy and std::copy performance

回眸只為那壹抹淺笑 提交于 2019-12-28 10:07:05
问题 I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(mas, mas+sz, [&](){return ++a;}); const int nn = 1000;//Number of iteration in tester loops std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; //std::copy testing start1 = std::chrono::system_clock::now();

SSE-copy, AVX-copy and std::copy performance

大城市里の小女人 提交于 2019-12-28 10:06:08
问题 I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(mas, mas+sz, [&](){return ++a;}); const int nn = 1000;//Number of iteration in tester loops std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; //std::copy testing start1 = std::chrono::system_clock::now();

SSE intrinsic functions reference [closed]

五迷三道 提交于 2019-12-28 02:24:26
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . Does anyone know of a reference listing the operation of the SSE intrinsic functions for gcc, i.e. the functions in the <*mmintrin.h> header files? Thanks. 回答1: As well as all the online PDF documentation already mentioned, there is also a very useful utility which summarizes all the instructions and intrinsics

Fastest way to compute absolute value using SSE

孤街醉人 提交于 2019-12-27 14:00:41
问题 I am aware of 3 methods, but as far as I know, only the first 2 are generally used: Mask off the sign bit using andps or andnotps . Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop. Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch. Subtract the value from zero to negate, and then get the max of the original and negated. Pros: Fixed cost because nothing is needed to

Using AVX CPU instructions: Poor performance without “/arch:AVX”

梦想的初衷 提交于 2019-12-27 13:05:06
问题 My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to include this: #include "immintrin.h" and then you can use intrinsics AVX functions like _mm256_mul_ps , _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning: warning C4752: found Intel(R)

Using AVX CPU instructions: Poor performance without “/arch:AVX”

╄→гoц情女王★ 提交于 2019-12-27 13:03:44
问题 My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to include this: #include "immintrin.h" and then you can use intrinsics AVX functions like _mm256_mul_ps , _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning: warning C4752: found Intel(R)

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

和自甴很熟 提交于 2019-12-27 10:22:04
问题 I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER . Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no

SSE: why, technically, is 16-aligned data faster to move?

会有一股神秘感。 提交于 2019-12-25 18:43:16
问题 Is it a bus architecture issue? How is it circumvented in i7? I'm aware of this, I just don't think it answers the real why . 回答1: The processor is built to work with data of certain sizes and alignments. When you use data outside of those sizes and alignments, you effectively need to shift it into alignment, crop it, compute on it using the normal instructions, then shift it back into place. 来源: https://stackoverflow.com/questions/24963646/sse-why-technically-is-16-aligned-data-faster-to