sse | 易学教程

Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

阅读更多关于 Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

问题 (Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object? , trying to use an MSVC-specific method on GCC's definition of __m128i . But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.) I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking. I continue

SSE3 intrinsics: How to find the maximum of a large array of floats

阅读更多关于 SSE3 intrinsics: How to find the maximum of a large array of floats

问题 I have the following code to find the maximum value int length = 2000; float *data; // data is allocated and initialized float max = 0.0; for(int i = 0; i < length; i++) { if(data[i] > max) { max = data; } } I tried vectorizing it by using SSE3 intrinsics, but I am kind of struck on how I should do the comparison. int length = 2000; float *data; // data is allocated and initialized float max = 0.0; // for time being just assume that length is always mod 4 for(int i = 0; i < length; i+=4) { _

Is there a good double-precision small matrix SIMD library for x86?

阅读更多关于 Is there a good double-precision small matrix SIMD library for x86?

问题 I'm looking for a SIMD library focused small (4x4) matrix operations for graphics. There's lots of single precision ones out there, but I need to support both single and double precision. I've looked at Intel's IPP MX library, but I'd prefer something with source. I'm very interested in SSE3+ implementations of these particular operations: Mat4 * Mat4 Mat4 * Vec4 Mat4 * Array of Mat4 Mat4 * Array of Vec4 Mat4 inversion (nice to have) EDIT: No "premature optimization" answers please. Anyone

Converting from m128 to m128i results in wrong value

阅读更多关于 Converting from __m128 to __m128i results in wrong value

问题 I need to convert a float vector (__m128) to an integer vector (__m128i), and I am using _mm_cvtps_epi32 , but I am not getting the expected value. Here is a very simple example: __m128 test = _mm_set1_ps(4.5f); __m128i test_i = _mm_cvtps_epi32(test); The debugger output I get: (lldb) po test ([0] = 4.5, [1] = 4.5, [2] = 4.5, [3] = 4.5) (lldb) po test_i ([0] = 17179869188, [1] = 17179869188) (lldb) As you can see, the resulting integer is.. 17179869188? From 4.5? And why are there only two

SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation

阅读更多关于 SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation

问题 I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched statement which protects against divide by zero I can't seem to convert to SSE. In serial this is: const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ); Where wZ is the z-coordinate and this calculation needs to be done for all four rays.

Faster approximate reciprocal square root of an array

阅读更多关于 Faster approximate reciprocal square root of an array

问题 How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2? The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats. The output is an array of floats. Both arrays have correct memory alignment for sse. The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2 Thank you. #include <iostream> #include <emmintrin.h> #include <time.h> using namespace std; void print

Use load/store correctly

阅读更多关于 Use load/store correctly

问题 How to use load/store to do aligned int16_t byte swapping correctly? void byte_swapping(uint16_t* dest, const uint16_t* src, size_t count) { __m128i _s, _d; for (uint16_t const * end(dest + count); dest != end; dest += 8, src += 8) { _s = _mm_load_si128((__m128i*)src); _d = _mm_or_si128(_mm_slli_epi16(_s, 8), _mm_srli_epi16(_s, 8)); _mm_store_si128((__m128i*) dest, _d); } } 回答1: Your code will fail when count is not a multiple of 8, or when either src or dest is not 16 byte aligned. Here is a

some mandelbrot drawing routine from c to sse2

阅读更多关于 some mandelbrot drawing routine from c to sse2

问题 I want to rewrite such simple routine to SSE2 code, (preferably in nasm) and I am not totally sure how to do it, two things not clear (how to express calculations (inner loop and those from outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);" from under staticaly linked asm code void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER) { double ly = lx * double(CLIENT_Y)/double(CLIENT_X); double dx = lx / CLIENT_X; double dy = ly / CLIENT_Y; double ax

Passing types containing SSE/AVX values

阅读更多关于 Passing types containing SSE/AVX values

问题 Let's say I have the following struct A { __m256 a; } struct B { __m256 a; float b; } Which of the following's generally better (if any and why) in a hard core loop? void f0(A a) { ... } void f1(A& a) { ... } //and the pointer variation void f2(B b) { ...} void f3(B& b) { ... } //and the pointer variation 回答1: The answer is that it doesn't matter. According to this: http://msdn.microsoft.com/en-us/library/ms235286.aspx The calling convention states that 16-byte (and probably 32-byte) operands

Why is there no floating point intrinsic for `PSHUFD` instruction?

阅读更多关于 Why is there no floating point intrinsic for `PSHUFD` instruction?

问题 The task I'm facing is to shuffle one _m128 vector and store the result in the other one. The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector: _mm_shuffle_ps , which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move. _mm_shuffle_epi32 , which uses PSHUFD instruction that seems to do exactly what is expected here and can