sse2 | 易学教程

Convert _mm_shuffle_epi32 to C expression for the permutation?

阅读更多关于 Convert _mm_shuffle_epi32 to C expression for the permutation?

I'm working on a port of SSE2 to NEON. The port is early stage and it's producing incorrect results. Part of the reason for the incorrect results is _mm_shuffle_epi32 and the NEON instructions I selected. The documentation for _mm_shuffle_epi32 is on the lean side from Microsoft . The Intel documentation is better, but it's not clear to me what some of the pseudo-code is doing. SELECT4(src, control) { CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src[95:64] 3: tmp[31:0] := src[127:96] ESAC RETURN tmp[31:0] } dst[31:0] := SELECT4(a[127:0], imm8[1:0])

SSE multiplication of 4 32-bit integers

阅读更多关于 SSE multiplication of 4 32-bit integers

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it. If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0]

Valgrind and Java

阅读更多关于 Valgrind and Java

问题 I want to use Valgrind 3.7.0 to find memory leaks in my Java native code. I'm using jdk1.6.0._29. To do that, I have to set the --trace-children=yes flag. Setting that flag, I no longer can run valgrind on any java application, even a command like: valgrind --trace-children=yes --smc-check=all java -version will get the error message: Error occurred during initialization of VM Unknown x64 processor: SSE2 not supported I've seen this link: https://bugs.kde.org/show_bug.cgi?id=249943, but it

SSE instructions to add all elements of an array [duplicate]

阅读更多关于 SSE instructions to add all elements of an array [duplicate]

问题 This question already has answers here : Sum reduction of unsigned bytes without overflow, using SSE2 on Intel (2 answers) Closed 2 years ago . I am new to SSE2 instructions. I have found an instruction _mm_add_epi8 which can add two array elements. But I want an SSE instruction which can add all elements of an array. I was trying to develop this concept using this code: #include <iostream> #include <conio.h> #include <emmintrin.h> void sse(unsigned char* a,unsigned char* b); void main() { /

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

阅读更多关于 Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

I am trying to find sum reduction of 32 elements (each 1 byte data) on an Intel i3 processor. I did this: s=0; for (i=0; i<32; i++) { s = s + a[i]; } However, its taking more time, since my application is a real-time application requiring much lesser time. Please note that the final sum could be more than 255. Is there a way I can implement this using low level SIMD SSE2 instructions? Unfortunately I have never used SSE. I tried searching for sse2 function for this purpose, but it is also not available. Is it (sse) guaranteed to reduce the computation time for such a small-sized problems? Any

Emulating shifts on 32 bytes with AVX

阅读更多关于 Emulating shifts on 32 bytes with AVX

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics. Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.) Can you recommend me a short substitute ? UPDATE: _mm256_slli_si256 is efficiently achieved with _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N) or _mm256_slli_si256(_mm256

Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

阅读更多关于 Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?

问题 I want to calculate y = ax + b , where x and y is a pixel value [i.e, byte with value range is 0~255], while a and b is a float Since I need to apply this formula for each pixel in image, in addition, a and b is different for different pixel. Direct calculation in C++ is slow, so I am kind of interest to know the sse2 instruction in c++.. After searching, I find that the multiplication and addition in float with sse2 is just as _mm_mul_ps and _mm_add_ps . But in the first place I need to

Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

阅读更多关于 Is it possible to use SSE and SSE2 to make a 128-bit wide integer?

I'm looking to understand SSE2's capabilities a little more, and would like to know if one could make a 128-bit wide integer that supports addition, subtraction, XOR and multiplication? phuclv SSE2 has no carry flag but you can easily calculate the carry as carry = sum < a or carry = sum < b like this . But worse yet, SSE2 doesn't have 64-bit comparisons too, so you must use some workarounds like the one here Here is an untested, unoptimized C code based on the idea above. inline bool lessthan(__m128i a, __m128i b){ a = _mm_xor_si128(a, _mm_set1_epi32(0x80000000)); b = _mm_xor_si128(b, _mm

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

阅读更多关于 Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2 . Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote: Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended

Convert _mm_shuffle_epi32 to C expression for the permutation?

阅读更多关于 Convert _mm_shuffle_epi32 to C expression for the permutation?

问题 I'm working on a port of SSE2 to NEON. The port is early stage and it's producing incorrect results. Part of the reason for the incorrect results is _mm_shuffle_epi32 and the NEON instructions I selected. The documentation for _mm_shuffle_epi32 is on the lean side from Microsoft. The Intel documentation is better, but it's not clear to me what some of the pseudo-code is doing. SELECT4(src, control) { CASE(control[1:0]) 0: tmp[31:0] := src[31:0] 1: tmp[31:0] := src[63:32] 2: tmp[31:0] := src