sse

Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

与世无争的帅哥 提交于 2019-12-08 20:22:57
问题 (Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object? , trying to use an MSVC-specific method on GCC's definition of __m128i . But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.) I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking. I continue

SSE3 intrinsics: How to find the maximum of a large array of floats

不羁岁月 提交于 2019-12-08 19:08:56
问题 I have the following code to find the maximum value int length = 2000; float *data; // data is allocated and initialized float max = 0.0; for(int i = 0; i < length; i++) { if(data[i] > max) { max = data; } } I tried vectorizing it by using SSE3 intrinsics, but I am kind of struck on how I should do the comparison. int length = 2000; float *data; // data is allocated and initialized float max = 0.0; // for time being just assume that length is always mod 4 for(int i = 0; i < length; i+=4) { _

Is there a good double-precision small matrix SIMD library for x86?

我的未来我决定 提交于 2019-12-08 15:48:43
问题 I'm looking for a SIMD library focused small (4x4) matrix operations for graphics. There's lots of single precision ones out there, but I need to support both single and double precision. I've looked at Intel's IPP MX library, but I'd prefer something with source. I'm very interested in SSE3+ implementations of these particular operations: Mat4 * Mat4 Mat4 * Vec4 Mat4 * Array of Mat4 Mat4 * Array of Vec4 Mat4 inversion (nice to have) EDIT: No "premature optimization" answers please. Anyone

Converting from __m128 to __m128i results in wrong value

ぃ、小莉子 提交于 2019-12-08 11:37:16
问题 I need to convert a float vector (__m128) to an integer vector (__m128i), and I am using _mm_cvtps_epi32 , but I am not getting the expected value. Here is a very simple example: __m128 test = _mm_set1_ps(4.5f); __m128i test_i = _mm_cvtps_epi32(test); The debugger output I get: (lldb) po test ([0] = 4.5, [1] = 4.5, [2] = 4.5, [3] = 4.5) (lldb) po test_i ([0] = 17179869188, [1] = 17179869188) (lldb) As you can see, the resulting integer is.. 17179869188? From 4.5? And why are there only two

SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation

六眼飞鱼酱① 提交于 2019-12-08 07:47:19
问题 I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched statement which protects against divide by zero I can't seem to convert to SSE. In serial this is: const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ); Where wZ is the z-coordinate and this calculation needs to be done for all four rays.

Faster approximate reciprocal square root of an array

☆樱花仙子☆ 提交于 2019-12-08 06:56:40
问题 How to calculate approximate reciprocal square root of an array faster on a cpu with popcnt and SSE4.2? The input is positive integers (ranges from 0 to about 200,000) stored in an array of floats. The output is an array of floats. Both arrays have correct memory alignment for sse. The code below only use 1 xmm register, runs on linux, and can be compiled by gcc -O3 code.cpp -lrt -msse4.2 Thank you. #include <iostream> #include <emmintrin.h> #include <time.h> using namespace std; void print

Use load/store correctly

烂漫一生 提交于 2019-12-08 04:21:52
问题 How to use load/store to do aligned int16_t byte swapping correctly? void byte_swapping(uint16_t* dest, const uint16_t* src, size_t count) { __m128i _s, _d; for (uint16_t const * end(dest + count); dest != end; dest += 8, src += 8) { _s = _mm_load_si128((__m128i*)src); _d = _mm_or_si128(_mm_slli_epi16(_s, 8), _mm_srli_epi16(_s, 8)); _mm_store_si128((__m128i*) dest, _d); } } 回答1: Your code will fail when count is not a multiple of 8, or when either src or dest is not 16 byte aligned. Here is a

some mandelbrot drawing routine from c to sse2

我们两清 提交于 2019-12-08 02:47:00
问题 I want to rewrite such simple routine to SSE2 code, (preferably in nasm) and I am not totally sure how to do it, two things not clear (how to express calculations (inner loop and those from outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);" from under staticaly linked asm code void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER) { double ly = lx * double(CLIENT_Y)/double(CLIENT_X); double dx = lx / CLIENT_X; double dy = ly / CLIENT_Y; double ax

Passing types containing SSE/AVX values

你。 提交于 2019-12-08 01:11:51
问题 Let's say I have the following struct A { __m256 a; } struct B { __m256 a; float b; } Which of the following's generally better (if any and why) in a hard core loop? void f0(A a) { ... } void f1(A& a) { ... } //and the pointer variation void f2(B b) { ...} void f3(B& b) { ... } //and the pointer variation 回答1: The answer is that it doesn't matter. According to this: http://msdn.microsoft.com/en-us/library/ms235286.aspx The calling convention states that 16-byte (and probably 32-byte) operands

Why is there no floating point intrinsic for `PSHUFD` instruction?

☆樱花仙子☆ 提交于 2019-12-08 01:03:43
问题 The task I'm facing is to shuffle one _m128 vector and store the result in the other one. The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector: _mm_shuffle_ps , which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move. _mm_shuffle_epi32 , which uses PSHUFD instruction that seems to do exactly what is expected here and can