simd | 易学教程

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

阅读更多关于 Most efficient way to check if all __m128i components are 0 [using

问题 I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed: __m128i oldRect; // contains old left, top, right, bottom packed to 128 bits __m128i newRect; // contains new left, top, right, bottom packed to 128 bits __m128i xor = _mm_xor_si128(oldRect, newRect); At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that? Currently I am doing so: if (xor.m128i_u64[0] | xor

What's the difference between logical SSE intrinsics?

阅读更多关于 What's the difference between logical SSE intrinsics?

问题 Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86

Fastest way to do horizontal vector sum with AVX instructions [duplicate]

阅读更多关于 Fastest way to do horizontal vector sum with AVX instructions [duplicate]

问题 This question already has answers here : Get sum of values stored in __m256d with SSE/AVX (2 answers) Closed 11 months ago . I have a packed vector of four 64-bit floating-point values. I would like to get the sum of the vector's elements. With SSE (and using 32-bit floats) I could just do the following: v_sum = _mm_hadd_ps(v_sum, v_sum); v_sum = _mm_hadd_ps(v_sum, v_sum); Unfortunately, even though AVX features a _mm256_hadd_pd instruction, it differs in the result from the SSE version. I

SSE multiplication of 4 32-bit integers

阅读更多关于 SSE multiplication of 4 32-bit integers

问题 How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it. 回答1: If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want: static inline __m128i muly(const __m128i &a, const __m128i &b) { __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/ __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */ return _mm_unpacklo_epi32(_mm_shuffle_epi32

Multithreaded & SIMD vectorized Mandelbrot in R using Rcpp & OpenMP

阅读更多关于 Multithreaded & SIMD vectorized Mandelbrot in R using Rcpp & OpenMP

问题 As an OpenMP & Rcpp performance test I wanted to check how fast I could calculate the Mandelbrot set in R using the most straightforward and simple Rcpp + OpenMP implementation. Currently what I did was: #include <Rcpp.h> #include <omp.h> // [[Rcpp::plugins(openmp)]] using namespace Rcpp; // [[Rcpp::export]] Rcpp::NumericMatrix mandelRcpp(const double x_min, const double x_max, const double y_min, const double y_max, const int res_x, const int res_y, const int nb_iter) { Rcpp::NumericMatrix

Why vectorizing the loop does not have performance improvement

阅读更多关于 Why vectorizing the loop does not have performance improvement

问题 I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code: #include <stdio.h> #include <sys/time.h> #include <stdlib.h> #define LEN 10000000 int main(){ struct timeval stTime, endTime; double* a = (double*)malloc(LEN*sizeof(*a)); double* b = (double*)malloc(LEN*sizeof(*b)); double* c = (double*)malloc(LEN*sizeof(*c)); int k; for(k = 0; k < LEN; k++){ a[k] = rand(); b[k] = rand(); } gettimeofday(&stTime, NULL); for(k = 0; k

How to efficiently perform double/int64 conversions with SSE/AVX?

阅读更多关于 How to efficiently perform double/int64 conversions with SSE/AVX?

问题 SSE2 has instructions for converting vectors between single-precision floats and 32-bit integers. _mm_cvtps_epi32() _mm_cvtepi32_ps() But there are no equivalents for double-precision and 64-bit integers. In other words, they are missing: _mm_cvtpd_epi64() _mm_cvtepi64_pd() It seems that AVX doesn't have them either. What is the most efficient way to simulate these intrinsics? 回答1: There's no single instruction until AVX512 , which added conversion to/from 64-bit integers, signed or unsigned.

What's missing/sub-optimal in this memcpy implementation?

阅读更多关于 What's missing/sub-optimal in this memcpy implementation?

问题 I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation: __forceinline // Since Size is usually known, // most useless code will be optimized out // if the function is inlined. void* myMemcpy(char* Dst, const char* Src, size_t Size) { void* start = Dst; for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) ) { __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++)

How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

阅读更多关于 How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

问题 The intrinsic: int mask = _mm256_movemask_epi8(__m256i s1) creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1 . After manipulating the mask using bit operations ( BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8 , i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask . What is the best way to do this? Edit: I need to perform the inverse because the

How to implement atoi using SIMD?

阅读更多关于 How to implement atoi using SIMD?

问题 I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places. If it's a speed gain, multiple atoi results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling. The algorithm I came up with is the following: I can initialize a vector of length N in the