sse | 易学教程

Is __int128_t arithmetic emulated by GCC, even with SSE?

阅读更多关于 Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC. The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions

efficient way to convert scatter indices into gather indices?

阅读更多关于 efficient way to convert scatter indices into gather indices?

I'm trying to write a stream compaction (take an array and get rid of empty elements) with SIMD intrinsics. Each iteration of the loop processes 8 elements at a time (SIMD width). With SSE intrinsics, I can do this fairly efficiently with _mm_shuffle_epi8(), which does a 16 entry table lookup (gather in parallel computing terminology). The shuffle indices are precomputed, and looked up with a bit mask. for (i = 0; i < n; i += 8) { v8n_Data = _mm_load_si128(&data[i]); mask = _mm_movemask_epi8(&is_valid[i]) & 0xff; // is_valid is byte array v8n_Compacted = _mm_shuffle_epi8(v16n_ShuffleIndices

Checking if TWO SSE registers are not both zero without destroying them

阅读更多关于 Checking if TWO SSE registers are not both zero without destroying them

I want to test if two SSE registers are not both zero without destroying them. This is the code I currently have: uint8_t *src; // Assume it is initialized and 16-byte aligned __m128i xmm0, xmm1, xmm2; xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1 xmm1 = _mm_load_si128((__m128i const*)&src[i+16]); xmm2 = _mm_or_si128(xmm0, xmm1); if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero } Is this the best way (using up to SSE 4.2)? I learned something useful from this question. Let's first look at some scalar code extern foo2(int x, int y); void foo(int x,

Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

阅读更多关于 Shifting SSE/AVX registers 32 bits left and right while shifting in zeros

I want to shift SSE/AVX registers multiples of 32 bits left or right while shifting in zeros. Let me be more precise on the shifts I'm interested in. For SSE I want to do the following shifts of four 32bit floats: shift1_SSE: [1, 2, 3, 4] -> [0, 1, 2, 3] shift2_SSE: [1, 2, 3, 4] -> [0, 0, 1, 2] For AVX I want to shift do the following shifts: shift1_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 1, 2, 3, 4, 5, 6, 7] shift2_AVX: [1, 2, 3, 4, 5, 6, 7, 8] -> [0, 0, 1, 2, 3, 4, 5, 6] shift3_AVX: [1, 2, 3, 4 ,5 ,6, 7, 8] -> [0, 0, 0, 0, 1, 2, 3, 4] For SSE I have come up with the following code shift1_SSE =

Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors

阅读更多关于 Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors

问题 I know how to sum one __m256 to get a single summed value. However, I have 8 vectors like Input 1: a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7], ....., ....., 8: h[0], h[1], h[2], h[3], h[4], a[5], a[6], a[7] Output a[0]+a[1]+a[2]+a[3]+a[4]+a[5]+a[6]+a[7], ...., h[0]+h[1]+h[2]+h[3]+h[4]+h[5]+h[6]+h[7] My method. Curious if there is a better way. __m256 sumab = _mm256_hadd_ps(accumulator1, accumulator2); __m256 sumcd = _mm256_hadd_ps(accumulator3, accumulator4); __m256 sumef = _mm256_hadd_ps

Efficient SSE NxN matrix multiplication

阅读更多关于 Efficient SSE NxN matrix multiplication

I'm trying to implement SSE version of large matrix by matrix multiplication. I'm looking for an efficient algorithm based on SIMD implementations. My desired method looks like: A(n x m) * B(m x k) = C(n x k) And all matrices are considered to be 16-byte aligned float array. I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more specific). So I'd appreciate if anyone can help me find some articles or resources on how to start

developing for new instruction sets

阅读更多关于 developing for new instruction sets

Intel is set to release a new instruction set called AVX , which includes an extension of SSE to 256-bit operation. That is, either 4 double-precision elements or 8 single-precision elements. How would one go about developing code for AVX, considering there's no hardware out there that supports it yet? More generally, how can developers write code for hardware that doesn't exist, for instance if they want to have software ready when the supporting CPU is released? Maybe I'm missing something about your question but it seems the answer is on the website that you linked. Use the Intel Compiler

SSE _mm_movemask_epi8 equivalent method for ARM NEON

阅读更多关于 SSE _mm_movemask_epi8 equivalent method for ARM NEON

I decided to continue Fast corners optimisation and stucked at _mm_movemask_epi8 SSE instruction. How can i rewrite it for ARM Neon with uint8x16_t input? I know this post is quite outdated but I found it useful to give my (validated) solution. It assumes all ones/all zeroes in every lane of the Input argument. const uint8_t __attribute__ ((aligned (16))) _Powers[16]= { 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 }; // Set the powers of 2 (do it once for all, if applicable) uint8x16_t Powers= vld1q_u8(_Powers); // Compute the mask from the input uint64x2_t Mask= vpaddlq_u32

SSE and C++ containers

阅读更多关于 SSE and C++ containers

Is there an obvious reason why the following code segfaults ? #include <vector> #include <emmintrin.h> struct point { __m128i v; point() { v = _mm_setr_epi32(0, 0, 0, 0); } }; int main(int argc, char *argv[]) { std::vector<point> a(3); } Thanks Edit: I'm using g++ 4.5.0 on linux/i686, I might not know what I'm doing here, but since even the following segfaults int main(int argc, char *argv[]) { point *p = new point(); } I really think it must be and alignment issue. Ben Voigt The obvious thing that could have gone wrong would be if v wasn't aligned properly. But it's allocated dynamically by

Does Java strictfp modifier have any effect on modern CPUs?

阅读更多关于 Does Java strictfp modifier have any effect on modern CPUs?

I know the meaning of the strictfp modifier on methods (and on classes), according to the JLS: JLS 8.4.3.5, strictfp methods: The effect of the strictfp modifier is to make all float or double expressions within the method body be explicitly FP-strict (§15.4). JLS 15.4 FP-strict expressions: Within an FP-strict expression, all intermediate values must be elements of the float value set or the double value set, implying that the results of all FP-strict expressions must be those predicted by IEEE 754 arithmetic on operands represented using single and double formats. Within an expression that