intrinsics

Store __m256i to integer

北战南征 提交于 2019-12-02 01:03:05
How can I store __m256i data type to integer? I know that for floats there is : _mm256_store_ps(float *a, __m256 b) where the first argument is the output array. For integers I found only : _mm256_store_si256(__m256i *a, __m256i b) where both arguments are __m256i data type. Is it enough to do something like this: int * X = (int*) _mm_malloc( N * sizeof (*X) ,32 ); ( I am using this as an argument to a function and I want to obtain it's values) Inside function: __m256i * Xmmtype = (__m256i*) X; //fill output _mm256_store_si256( &Xmmtype[ i ] , T ); //T is __m256i Is this OK? -----UPDATED -----

Test case for adcx and adox

隐身守侯 提交于 2019-12-02 00:42:23
I'm testing Intel ADX add with carry and add with overflow to pipeline adds on large integers. I'd like to see what expected code generation should look like. From _addcarry_u64 and _addcarryx_u64 with MSVC and ICC , I thought this would be a suitable test case: #include <stdint.h> #include <x86intrin.h> #include "immintrin.h" int main(int argc, char* argv[]) { #define MAX_ARRAY 100 uint8_t c1 = 0, c2 = 0; uint64_t a[MAX_ARRAY]={0}, b[MAX_ARRAY]={0}, res[MAX_ARRAY]; for(unsigned int i=0; i< MAX_ARRAY; i++){ c1 = _addcarryx_u64(c1, res[i], a[i], (unsigned long long int*)&res[i]); c2 =

_mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument

◇◆丶佛笑我妖孽 提交于 2019-12-01 22:45:51
问题 I've lately been using the SSE intrinsic int _mm_extract_epi8 (__m128i src, const int ndx) that, according to the reference "extracts an integer byte from a packed integer array element selected by index". This is exactly what I want. However, I determine the index via a _mm_cmpestri on a _m128i that performs a packed comparison of string data with explicit lengths and generates the index. The range of this index is 0..16 where 0..15 represents a valid index and 16 means that no index was

_umul128 on Windows 32 bits

只愿长相守 提交于 2019-12-01 22:36:39
问题 In Visual C++, _umul128 is undefined when targeting Windows 32 bits. How can two unsigned 64 bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting Windows 32 bits. 回答1: This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang. I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu for 32x32 => 64b

_mm_extract_epi8(…) intrinsic that takes a non-literal integer as argument

别说谁变了你拦得住时间么 提交于 2019-12-01 21:39:10
I've lately been using the SSE intrinsic int _mm_extract_epi8 (__m128i src, const int ndx) that, according to the reference "extracts an integer byte from a packed integer array element selected by index". This is exactly what I want. However, I determine the index via a _mm_cmpestri on a _m128i that performs a packed comparison of string data with explicit lengths and generates the index. The range of this index is 0..16 where 0..15 represents a valid index and 16 means that no index was found. Now to extract the integer at the index position I thought of doing the following: const int index

set individual bit in AVX register (__m256i), need “random access” operator

微笑、不失礼 提交于 2019-12-01 21:19:56
So, I want to set an individual bit of a __m256i register. Say, my __m256i contains: [ 1 0 1 0 | 1 0 1 0 | ... | 1 0 1 0 ] , how do I set and unset the n-th bit? ErmIg This is an implementation of function which can set individual bit inside a vector: #include <immintrin.h> #include <assert.h> void SetBit(__m256i & vector, size_t position, bool value) { assert(position <= 255); uint8_t lut[32] = { 0 }; lut[position >> 3] = 1 << (position & 7); __m256i mask = _mm256_loadu_si256((__m256i*)lut); if (value) vector = _mm256_or_si256(mask, vector); else vector = _mm256_andnot_si256(mask, vector); }

How to use RDRAND intrinsics?

谁都会走 提交于 2019-12-01 20:23:33
I was looking at H.J. Lu's PATCH: Update x86 rdrand intrinsics . I can't tell if I should be using _rdrand_u64 , _rdrand64_step , or if there are other function(s). There does not appear to be test cases written for them. There also seems to be a lack of man pages (from Ubuntu 14, GCC 4.8.4): $ man -k rdrand rdrand: nothing appropriate. How does one use the RDRAND intrinsics to generate, say, a block of 32 bytes? A related question is RDRAND and RDSEED intrinsics GCC and Intel C++ . But it does not tell me how to use them, or how to generate a block. If you look at <immintrin.h> (mine is in `

_umul128 on Windows 32 bits

て烟熏妆下的殇ゞ 提交于 2019-12-01 20:21:09
In Visual C++, _umul128 is undefined when targeting Windows 32 bits. How can two unsigned 64 bit integers be multiplied when targeting Win32? The solution only needs to work on Visual C++ 2017 targeting Windows 32 bits. This answer has a version of the xmrrig function from the other answer optimized for MSVC 32-bit mode. The original is fine with other compilers, especially clang. I looked at MSVC's output for @Augusto's function, and it's really bad. Using __emulu for 32x32 => 64b multiplication improved it significantly (because MSVC is dumb and doesn't optimize uint64_t * uint64_t = uint64

AVX2 expand contiguous elements to a sparse vector based on a condition? (like AVX512 VPEXPANDD)

雨燕双飞 提交于 2019-12-01 19:29:43
问题 Does anyone know how to vectorize the following code? uint32_t r[8]; uint16_t* ptr; for (int j = 0; j < 8; ++j) if (r[j] < C) r[j] = *(ptr++); It's basically a masked gather operation. The auto-vectorizer can't deal with this. If ptr was a uint32_t* it should be directly realizable with _mm256_mask_i32gather_epi32. But even then how do you generate the correct index vector? And wouldn't it be faster to just use a packed load and shuffling the result anyway (requiring a similar index vector)?

Translating SSE to Neon: How to pack and then extract 32bit result

帅比萌擦擦* 提交于 2019-12-01 18:19:12
I have to translate the following instructions from SSE to Neon uint32_t a = _mm_cvtsi128_si32(_mm_shuffle_epi8(a,SHUFFLE_MASK) ); Where: static const __m128i SHUFFLE_MASK = _mm_setr_epi8(3, 7, 11, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1); So basically I have to take 4th,8th,12th and 16th bytes from the register and put it into an uint32_t . Looks like a packing instruction (in SSE I seem to remember I used shuffle because it saves one instructions compared to packing, this example shows the use of packing instructions). How does this operation translate in Neon? Should I use