simd

NEON vectorize sum of products of unsigned bytes: (a[i]-int1) * (b[i]-int2)

一笑奈何 提交于 2019-12-05 02:05:16
问题 I need to improve a loop, because is called by my application thousands of times. I suppose I need to do it with Neon, but I don´t know where to begin. Assumptions / pre-conditions: w is always 320 (multiple of 16/32). pa and pb are 16-byte aligned ma and mb are positive. int whileInstruction (const unsigned char *pa,const unsigned char *pb,int ma,int mb,int w) { int sum=0; do { sum += ((*pa++)-ma)*((*pb++)-mb); } while(--w); return sum; } This attempt at vectorizing it is not working well,

How to compile SIMD code with gcc

删除回忆录丶 提交于 2019-12-05 01:39:19
问题 I wrote this code for Matrix multiplication in SIMD which i was able to compile in Visual Studio, but now I need to compile it in Ubuntu using gcc/g++. Which commands should I use to compile this? Do I need to make any changes to the code itself? #include <stdio.h> #include <stdlib.h> #include <xmmintrin.h> #include <iostream> #include <conio.h> #include <math.h> #include <ctime> using namespace std; #define MAX_NUM 1000 #define MAX_DIM 252 int main() { int l = MAX_DIM, m = MAX_DIM, n = MAX

SIMD/SSE: How to check that all vector elements are non-zero

China☆狼群 提交于 2019-12-05 01:37:01
I need to check that all vector elements are non-zero. So far I found following solution. Is there a better way to do this? I am using gcc 4.8.2 on Linux/x86_64, instructions up to SSE4.2. typedef char ChrVect __attribute__((vector_size(16), aligned(16))); inline bool testNonzero(ChrVect vect) { const ChrVect vzero = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; return (0 == (__int128_t)(vzero == vect)); } Update: code above is compiled to following assembler code (when compiled as non-inline function): movdqa %xmm0, -24(%rsp) pxor %xmm0, %xmm0 pcmpeqb -24(%rsp), %xmm0 movdqa %xmm0, -24(%rsp) movq -24(

Vectorize a function in clang

六月ゝ 毕业季﹏ 提交于 2019-12-05 00:54:04
I am trying to vectorize the following function with clang according to this clang reference . It takes a vector of byte array and applies a mask according to this RFC . static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) { #pragma clang loop vectorize(enable) interleave(enable) for (size_t i = 0; i < payload.size(); i++) { payload[i] = payload[i] ^ masking_key[i % 4]; } } The following flags are passed to clang: -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize However, the vectorization fails with the following error: WebSocket.cpp:5: WebSocket.h:14: In file

Fastest way to compute distance squared

狂风中的少年 提交于 2019-12-04 23:27:43
My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster. I now have: double distance_squared(double *a, double *b) { double dx = a[0] - b[0]; double dy = a[1] - b[1]; double dz = a[2] - b[2]; return dx*dx + dy*dy + dz*dz; } I also tried using a macro to avoid the function call but it doesn't help much. #define DISTANCE_SQUARED(a, b) ((a)[0]-(b)[0])*((a)[0]-(b)[0]) +

Shift elements to the left of a SIMD register based on boolean mask

蹲街弑〆低调 提交于 2019-12-04 20:57:40
This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value should be: {42, 13, X, X} ... Where X is "I don't care". An obvious way to do this is the use _mm

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

雨燕双飞 提交于 2019-12-04 19:52:40
As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture 4.1. SIMT Architecture ... A warp executes one common instruction at a time , so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete

The correct way to sum two arrays with SSE2 SIMD in C++

末鹿安然 提交于 2019-12-04 19:27:20
Let's start by including the following: #include <vector> #include <random> using namespace std; Now, suppose that one has the following three std:vector<float> : N = 1048576; vector<float> a(N); vector<float> b(N); vector<float> c(N); default_random_engine randomGenerator(time(0)); uniform_real_distribution<float> diceroll(0.0f, 1.0f); for(int i-0; i<N; i++) { a[i] = diceroll(randomGenerator); b[i] = diceroll(randomGenerator); } Now, assume that one needs to sum a and b element-wise and store the result in c , which in scalar form looks like the following: for(int i=0; i<N; i++) { c[i] = a[i]

SSE2: Double precision log function

若如初见. 提交于 2019-12-04 18:43:41
问题 I need open source (no restriction on license) implementation of log function, something with signature __m128d _mm_log_pd(__m128d); It is available in Intel Short Vector Math Library (part of ICC), but ICC is neither free nor open source. I am looking for implementation using intrinsics only. It should use special rational function approximations. I need something almost as accurate as cmath log, say 9-10 decimal digits, but faster. 回答1: Take a look at AMD LibM. It isn't open source, but

SSE - AVX conversion from double to char

别说谁变了你拦得住时间么 提交于 2019-12-04 18:34:19
I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si = _mm256_cvtpd_epi32(ih_pd); __m128i il_si = _mm256_cvtpd_epi32(il_pd); ih_si = _mm_shuffle_epi32(ih_si,_MM_SHUFFLE(3,1,2,0)); il_si = _mm_shuffle_epi32(il_si,_MM_SHUFFLE(3,1,2,0)); ih_si = _mm_packs_epi32(_mm_unpacklo_epi32(il_si,ih_si),_mm_unpackhi