simd

Extracting ints and shorts from a struct using AVX?

╄→гoц情女王★ 提交于 2019-12-12 14:22:45
问题 I have a struct which contains a union between various data members and an AVX type to load all the bytes in one load. My code looks like: #include <immintrin.h> union S{ struct{ int32_t a; int32_t b; int16_t c; int16_t d; }; __m128i x; } I'd like to use the AVX register to load the data all together and then separately extract the four members in to int32_t and int16_t local variables. How would I go about doing this? I am unsure how I can separate the data members from each other when

Improving a recursive hadamard transformation

梦想与她 提交于 2019-12-12 14:13:32
问题 I have the following code to calculate a Hadamard transform. Right now, the hadamard function is the bottleneck of my program. Do you see any potential to speed it up? Maybe using AVX2 instructions? Typical input sizes are around 512 or 1024. Best, Tom #include <stdio.h> void hadamard(double *p, size_t len) { double tmp = 0.0; if(len == 2) { tmp = p[0]; p[0] = tmp + p[1]; p[1] = tmp - p[1]; } else { hadamard(p, len/2); hadamard(p+len/2, len/2); for(int i = 0; i < len/2; i++) { tmp = p[i]; p[i

SSE rms calculation

风格不统一 提交于 2019-12-12 12:13:59
问题 I want to calculation the rms with the Intel sse intrinsic. Like this: float rms( float *a, float *b , int l) { int n=0; float r=0.0; for(int i=0;i<l;i++) { if(finitef(a[i]) && finitef(b[i])) { n++; tmp = a[i] - b[i]; r += tmp*tmp; } } r /= n; return r; } But how to check which elements are NaN? And how to count n? 回答1: You can test a value for NaN by comparing the value with itself. x == x will return false if x is a NaN. So for a SSE vector of 4 x float values, vx: vmask = _mm_cmpeq_ps(vx,

What compilers besides gcc can vectorize code?

这一生的挚爱 提交于 2019-12-12 12:13:47
问题 GCC can vectorize loops automatically when certain options are specified and given the right conditions. Are there other compilers widely available that can do the same? 回答1: ICC 回答2: llvm can also do it and vector pascal too and one that is not free VectorC. These are just some I remember. 回答3: Also PGI's compilers. 回答4: The Mono project, the Open Source alternative to Microsoft's Silverlight project, has added objects that use SIMD instructions. While not a compiler, the Mono CLR is the

Permuting bytes inside SSE __m128i register

喜欢而已 提交于 2019-12-12 10:37:21
问题 I have following problem: In __m128i register there are 16 8bit values in following ordering: [ 1, 5, 9, 13 ] [ 2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16] What I would like to achieve is efficiently shuffle bytes to get this ordering: [ 1, 2, 3, 4 ] [ 5, 6, 7, 8] [9, 10, 11, 12] [13, 14, 15, 16] It is actually analog to 4x4 matrix transposition, but operating on 8-bits element inside one register. Do you please can point me to what kind of SSE (preferabbly <= SSE2) instructions are suitable

Automatic vectorization of matrix multiplication

↘锁芯ラ 提交于 2019-12-12 10:19:12
问题 I'm fairly new with SIMD and wanted to try to see if I could get GCC to vectorise a simple action for me. So I looked at this post and wanted to do more or less the same thing. (but with gcc 5.4.0 on Linux 64bit, for a KabyLake processor) I essentially have this function: /* m1 = N x M matrix, m2 = M x P matrix, m3 = N x P matrix & output */ void mmul(double **m1, double **m2, double **m3, int N, int M, int P) { for (i = 0; i < N; i++) for (j = 0; j < P; j++) { double tmp = 0.0; for (k = 0; k

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

牧云@^-^@ 提交于 2019-12-12 09:53:19
问题 I have two unsigned vectors, both with size 4 vector<unsigned> v1 = {2, 4, 6, 8} vector<unsigned> v2 = {1, 10, 11, 13} Now I want to multiply these two vectors and get a new one vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13} What is the SSE operation to use? Is it cross platform or only in some specified platforms? Adding: If my goal is adding not multiplication, I can do this super fast: __m128i a = _mm_set_epi32(1,2,3,4); __m128i b = _mm_set_epi32(1,2,3,4); __m128i c; c = _mm_add_epi32

how to deinterleave image channel in SSE

六眼飞鱼酱① 提交于 2019-12-12 05:11:53
问题 is there any way we can DE-interleave 32bpp image channels similar as below code in neon. //Read all r,g,b,a pixels into 4 registers uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32); ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high); basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very

pairwise addition in neon

瘦欲@ 提交于 2019-12-12 03:35:39
问题 I want to add 00 and 01 indices value of int64x2_t vector in neon . I am not able to find any pairwise-add instruction which will do this functionality . int64x2_t sum_64_2; //I am expecting result should be.. //int64_t result = sum_64_2[0] + sum_64_2[1]; Is there any instruction in neon do to this logic. 回答1: You can write it in two ways. This one explicitly uses the NEON VADD.I64 instruction: int64x1_t f(int64x2_t v) { return vadd_s64 (vget_high_s64 (v), vget_low_s64 (v)); } and the

SIMD alignment issue with PPL Combinable

☆樱花仙子☆ 提交于 2019-12-12 02:16:59
问题 I'm trying to sum the elements of array in parallel with SIMD. To avoid locking I'm using combinable thread local which is not always aligned on 16 bytes because of that _mm_add_epi32 is throwing exception concurrency::combinable<__m128i> sum_combine; int length = 40; // multiple of 8 concurrency::parallel_for(0, length , 8, [&](int it) { __m128i v1 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it)); __m128i v2 = _mm_load_si128(reinterpret_cast<__m128i*>(input_arr + it + sizeof