sse

Define a `static const` SIMD Variable within a `C` Function

人盡茶涼 提交于 2019-12-01 21:57:20
问题 I have a function in this form (From Fastest Implementation of Exponential Function Using SSE): __m128 FastExpSse(__m128 x) { static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2) static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); static __m128 const m87 = _mm_set1_ps(-87); // fast exponential function, x should be in [-87, 87] __m128 mask = _mm_cmpge_ps(x, m87); __m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b); return _mm_and_ps(_mm_castsi128

_mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument

别说谁变了你拦得住时间么 提交于 2019-12-01 21:39:10
I've lately been using the SSE intrinsic int _mm_extract_epi8 (__m128i src, const int ndx) that, according to the reference "extracts an integer byte from a packed integer array element selected by index". This is exactly what I want. However, I determine the index via a _mm_cmpestri on a _m128i that performs a packed comparison of string data with explicit lengths and generates the index. The range of this index is 0..16 where 0..15 represents a valid index and 16 means that no index was found. Now to extract the integer at the index position I thought of doing the following: const int index

New AVX-instructions syntax

百般思念 提交于 2019-12-01 18:43:20
I had a C code written with some intel-intrinsincs. After I compiled it first with avx and then with ssse3 flags, I got two quite different assembly codes. E.g: AVX: vpunpckhbw %xmm0, %xmm1, %xmm2 SSSE3: movdqa %xmm0, %xmm2 punpckhbw %xmm1, %xmm2 It's clear that vpunpckhbw is just punpckhbw but using the avx three operand syntax. But is the latency and the throughput of the first instruction equivalent to the latency and the throughput of the last ones combined? Or does the answer depend on the architecture I'm using? It's IntelCore i5-6500 by the way. I tried to search for an answer in Agner

Are there unsigned equivalents of the x87 FILD and SSE CVTSI2SD instructions?

我只是一个虾纸丫 提交于 2019-12-01 18:19:45
I want to implement the equivalent of C's uint -to- double cast in the GHC Haskell compiler. We already implement int -to- double using FILD or CVTSI2SD . Is there unsigned versions of these operations or am I supposed to zero out the highest bit of the uint before the conversion (thus losing range)? You can exploit some of the properties of the IEEE double format and interpret the unsigned value as part of the mantissa, while adding some carefully crafted exponent. Bits 63 62-52 51-0 S Exp Mantissa 0 1075 20 bits 0, followed by your unsigned int The 1075 comes from the IEEE exponent bias

Translating SSE to Neon: How to pack and then extract 32bit result

帅比萌擦擦* 提交于 2019-12-01 18:19:12
I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t . Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions). How does this operation translate in Neon? Should I use

Translating SSE to Neon: How to pack and then extract 32bit result

吃可爱长大的小学妹 提交于 2019-12-01 18:03:53
问题 I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t . Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example

How does this function compute the absolute value of a float through a NOT and AND operation?

佐手、 提交于 2019-12-01 18:00:26
I am trying to understand how the following code snippet works. This program uses SIMD vector instructions (Intel SSE) to calculate the absolute value of 4 floats (so, basically, a vectorized "fabs()" function). Here is the snippet: #include <iostream> #include "xmmintrin.h" template <typename T> struct alignas(16) sse_t { T data[16/sizeof(T)]; }; int main() { sse_t<float> x; x.data[0] = -4.; x.data[1] = -20.; x.data[2] = 15.; x.data[3] = -143.; __m128 a = _mm_set_ps1(-0.0); // ??? __m128 xv = _mm_load_ps(x.data); xv = _mm_andnot_ps(a,xv); // <-- Computes absolute value sse_t<float> result;

SSE vectorization of math 'pow' function gcc

戏子无情 提交于 2019-12-01 16:43:47
I was trying to vectorize a loop that contains the use of the 'pow' function in the math library. I am aware intel compiler supports use of 'pow' for sse instructions - but I can't seem to get it to run with gcc ( I think ). This is the case I am working with: int main(){ int i=0; float a[256], b[256]; float x= 2.3; for (i =0 ; i<256; i++){ a[i]=1.5; } for (i=0; i<256; i++){ b[i]=pow(a[i],x); } for (i=0; i<256; i++){ b[i]=a[i]*a[i]; } return 0; } I'm compiling with the following: gcc -O3 -Wall -ftree-vectorize -msse2 -ftree-vectorizer-verbose=5 code.c -o runthis This is on os X 10.5.8 using

SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

送分小仙女□ 提交于 2019-12-01 16:36:42
I'm experimenting with SSE42 and STTNI instructions and have got strange result - PcmpEstrM (works with explicit length strings) runs twice slower than PcmpIstrM (implicit length strings). On my i7 3610QM the difference is 2366.2 ms vs. 1202.3 ms - 97% . On i5 3470 difference is not so huge, but is still significant = 3206.2 ms vs. 2623.2 ms - 22% . Both are "Ivy Bridge" - it is strange that they have so different "difference" (at least i can't see any technical differences in their specs - http://www.cpu-world.com/Compare_CPUs/Intel_AW8063801013511,Intel_CM8063701093302/ ). Intel 64 and IA-32

Using SIMD on amd64, when is it better to use more instructions vs. loading from memory?

≡放荡痞女 提交于 2019-12-01 15:21:24
I have some highly perf sensitive code. A SIMD implementation using SSEn and AVX uses about 30 instructions, while a version that uses a 4096 byte lookup table uses about 8 instructions. In a microbenchmark, the lookup table is faster by 40%. If I microbenchmark, trying to invalidate the cache very 100 iterations, they appear about the same. In my real program, it appears that the non-loading version is faster, but it's really hard to get a provably good measurement, and I've had measurements go both ways. I'm just wondering if there are some good ways to think about which one would be better