sse | 易学教程

Is an __m128i variable zero?

阅读更多关于 Is an __m128i variable zero?

问题 How do I test if a __m128i variable has any nonzero value on SSE-2-and-earlier processors? 回答1: In SSE2 you can do: __m128i zero = _mm_setzero_si128(); if(_mm_movemask_epi8(_mm_cmpeq_epi32(x,zero)) == 0xFFFF) { //the code... } this will test four int's vs zero then return a mask for each byte, so your bit-offsets of each corresponding int would be at 0, 4, 8 & 12, but the above test will catch if any bit is set, then if you preserve the mask you can work with the finer grained parts directly

Does Java strictfp modifier have any effect on modern CPUs?

阅读更多关于 Does Java strictfp modifier have any effect on modern CPUs?

问题 I know the meaning of the strictfp modifier on methods (and on classes), according to the JLS: JLS 8.4.3.5, strictfp methods: The effect of the strictfp modifier is to make all float or double expressions within the method body be explicitly FP-strict (§15.4). JLS 15.4 FP-strict expressions: Within an FP-strict expression, all intermediate values must be elements of the float value set or the double value set, implying that the results of all FP-strict expressions must be those predicted by

SSE-copy, AVX-copy and std::copy performance

阅读更多关于 SSE-copy, AVX-copy and std::copy performance

问题 I'm tried to improve performance of copy operation via SSE and AVX: #include <immintrin.h> const int sz = 1024; float *mas = (float *)_mm_malloc(sz*sizeof(float), 16); float *tar = (float *)_mm_malloc(sz*sizeof(float), 16); float a=0; std::generate(mas, mas+sz, [&](){return ++a;}); const int nn = 1000;//Number of iteration in tester loops std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; //std::copy testing start1 = std::chrono::system_clock::now();

SSE-copy, AVX-copy and std::copy performance

阅读更多关于 SSE-copy, AVX-copy and std::copy performance

SSE intrinsic functions reference [closed]

阅读更多关于 SSE intrinsic functions reference [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 7 years ago . Does anyone know of a reference listing the operation of the SSE intrinsic functions for gcc, i.e. the functions in the <*mmintrin.h> header files? Thanks. 回答1: As well as all the online PDF documentation already mentioned, there is also a very useful utility which summarizes all the instructions and intrinsics

Fastest way to compute absolute value using SSE

阅读更多关于 Fastest way to compute absolute value using SSE

问题 I am aware of 3 methods, but as far as I know, only the first 2 are generally used: Mask off the sign bit using andps or andnotps . Pros: One fast instruction if the mask is already in a register, which makes it perfect for doing this many times in a loop. Cons: The mask may not be in a register or worse, not even in a cache, causing a very long memory fetch. Subtract the value from zero to negate, and then get the max of the original and negated. Pros: Fixed cost because nothing is needed to

Using AVX CPU instructions: Poor performance without “/arch:AVX”

阅读更多关于 Using AVX CPU instructions: Poor performance without “/arch:AVX”

问题 My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX. To use AVX, it is necessary to include this: #include "immintrin.h" and then you can use intrinsics AVX functions like _mm256_mul_ps , _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning: warning C4752: found Intel(R)

Using AVX CPU instructions: Poor performance without “/arch:AVX”

阅读更多关于 Using AVX CPU instructions: Poor performance without “/arch:AVX”

Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

阅读更多关于 Why is this SSE code 6 times slower without VZEROUPPER on Skylake?

问题 I've been trying to figure out a performance problem in an application and have finally narrowed it down to a really weird problem. The following piece of code runs 6 times slower on a Skylake CPU (i5-6500) if the VZEROUPPER instruction is commented out. I've tested Sandy Bridge and Ivy Bridge CPUs and both versions run at the same speed, with or without VZEROUPPER . Now I have a fairly good idea of what VZEROUPPER does and I think it should not matter at all to this code when there are no

SSE: why, technically, is 16-aligned data faster to move?

阅读更多关于 SSE: why, technically, is 16-aligned data faster to move?

问题 Is it a bus architecture issue? How is it circumvented in i7? I'm aware of this, I just don't think it answers the real why . 回答1: The processor is built to work with data of certain sizes and alignments. When you use data outside of those sizes and alignments, you effectively need to shift it into alignment, crop it, compute on it using the normal instructions, then shift it back into place. 来源： https://stackoverflow.com/questions/24963646/sse-why-technically-is-16-aligned-data-faster-to