simd | 易学教程

How to convert unsigned char to signed integer using Neon SIMD

阅读更多关于 How to convert unsigned char to signed integer using Neon SIMD

How to convert a variable of data type uint8_t to int32_t using Neon? I could not find any intrinsic for doing this. Assuming you want to convert a vector of 16 x 8 bit ints to four vectors of 4 x 32 bit ints, you can do this by first unpacking to 16 bits and then again to 32 bits: // load 8 bit vector uint8x16_t v = vld1q_u8(p); // load vector of 16 x 8 bits ints from p // unpack to 16 bits int16x8_t vl = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(v))); // 0..7 int16x8_t vh = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(v))); // 8..15 // unpack to 32 bits int32x4_t vll = vmovl_s16(vget_low_s16

Existence of “simd reduction(:)” In GCC and MSVC?

阅读更多关于 Existence of “simd reduction(:)” In GCC and MSVC?

simd pragma can be used with icc compiler to perform a reduction operator: #pragma simd #pragma simd reduction(+:acc) #pragma ivdep for(int i( 0 ); i < N; ++i ) { acc += x[i]; } Is there any equivalent solution in msvc or/and gcc? Ref(p28): http://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf GCC definitely can vectorize. Suppose you have file reduc.c with contents: int foo(int *x, int N) { int acc, i; for( i = 0; i < N; ++i ) { acc += x[i]; } return acc; } Compile it (I used gcc 4.7.2) with command line: $ gcc -O3 -S reduc.c -ftree-vectorize -msse2 Now you can

Using __m256d registers

阅读更多关于 Using __m256d registers

问题 How do you use __m256d ? Say I want to use the Intel AVX instruction _mm256_add_pd on a simple Vector3 class with 3-64 bit double precision components ( x , y , and z ). What is the correct way to use this? Since x , y and z are members of the Vector3 class, _can I declare them in union with an __m256d variable? union Vector3 { struct { double x,y,z ; } ; __m256d _register ; // the Intel register? } ; Then can I go: Vector3 add( const Vector3& o ) { Vector3 result; result._register = _mm256

How to convert 'long long' (or int64) to m64

阅读更多关于 How to convert 'long long' (or __int64) to __m64

问题 What is the proper way to convert an __int64 value to an __m64 value for use with SSE? 回答1: With gcc you can just use _mm_set_pi64x : #include <mmintrin.h> __int64 i = 0x123456LL; __m64 v = _mm_set_pi64x(i); Note that not all compilers have _mm_set_pi64x defined in mmintrin.h . For gcc it's defined like this: extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_set_pi64x (long long __i) { return (__m64) __i; } which suggests that you could probably just

Optimal SSE unsigned 8 bit compare

阅读更多关于 Optimal SSE unsigned 8 bit compare

问题 I'm trying to find the most way of performing 8 bit unsigned compares using SSE (up to SSE 4.2). The most common case I'm working on is comparing for > 0U, e.g. _mm_cmpgt_epu8(v, _mm_setzero_si128()) // #1 (which of course can also be considered to be a simple test for non-zero.) But I'm also somewhat interested in the more general case, e.g. _mm_cmpgt_epu8(v1, v2) // #2 The first case can be implemented with 2 instructions, using various different methods, e.g. compare with 0 and then invert

How do I Perform Integer SIMD operations on the iPad A4 Processor?

阅读更多关于 How do I Perform Integer SIMD operations on the iPad A4 Processor?

I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor? Thanks, Doug Shervin Emami To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html Note that the iPad A4 uses the ARMv7-A CPU, so the reference manual for the NEON SIMD instructions is at:

Are GPU/CUDA cores SIMD ones?

阅读更多关于 Are GPU/CUDA cores SIMD ones?

问题 Let's take the nVidia Fermi Compute Architecture. It says: The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The 512 CUDA cores are organized in 16 SMs of 32 cores each. [...] Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). [...] In Fermi, the newly designed integer ALU supports full 32-bit precision

An accumulated computing error in SSE version of algorithm of the sum of squared differences

阅读更多关于 An accumulated computing error in SSE version of algorithm of the sum of squared differences

I was trying to optimize following code (sum of squared differences for two arrays): inline float Square(float value) { return value*value; } float SquaredDifferenceSum(const float * a, const float * b, size_t size) { float sum = 0; for(size_t i = 0; i < size; ++i) sum += Square(a[i] - b[i]); return sum; } So I performed optimization with using of SSE instructions of CPU: inline void SquaredDifferenceSum(const float * a, const float * b, size_t i, __m128 & sum) { __m128 _a = _mm_loadu_ps(a + i); __m128 _b = _mm_loadu_ps(b + i); __m128 _d = _mm_sub_ps(_a, _b); sum = _mm_add_ps(sum, _mm_mul_ps(

How to speed up calculation of integral image?

阅读更多关于 How to speed up calculation of integral image?

I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride; } } And I have a question. Can I speed up this algorithm (for example, with using of SSE or AVX)?

C++ Adding 2 arrays together quickly

阅读更多关于 C++ Adding 2 arrays together quickly

问题 Given the arrays: int canvas[10][10]; int addon[10][10]; Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon? IE, I want to achieve something like: canvas += another; So if canvas[0][0] =3 and addon[0][0] = 2 then canvas[0][0] = 5 Speed is essential here as I am writing a very simple program to brute force a knapsack type problem and there will be tens of millions of