avx

How can I convert a vector of float to short int using avx instructions?

≯℡__Kan透↙ 提交于 2019-12-22 09:09:17
问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

How can I convert a vector of float to short int using avx instructions?

妖精的绣舞 提交于 2019-12-22 09:09:15
问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

g++: No Such Instruction with AVX

白昼怎懂夜的黑 提交于 2019-12-22 07:07:17
问题 When I compiled a program I was writing in C++ (for the latest Macbook pro, which of course supports the AVX instruction set), I got the following errors. I am using the latest release of g++ obtained from Macports. Do you have any ideas as to what I can do to fix the error without restricting the instruction sets available to the compiler? Is there any package in particular that I should try to update? g++-mp-4.7 -std=c++11 -Wall -Ofast -march=native -fno-rtti src/raw_to_json.cpp -o bin/raw

Efficient way of rotating a byte inside an AVX register

一曲冷凌霜 提交于 2019-12-22 04:49:16
问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the

Shift elements to the left of a SIMD register based on boolean mask

蓝咒 提交于 2019-12-22 00:36:03
问题 This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value

Fastest 64-bit population count (Hamming weight)

 ̄綄美尐妖づ 提交于 2019-12-21 12:28:44
问题 I had to calculate the Hamming weight for a quite fast continious flow of 64-bit data and using the popcnt assembly instruction throws me a exception om my Intel Core i7-4650U. I checked my bible Hacker's delight and scanned the web for all kinds of algorithms (it's a bunch out there since they started tackling this 'problem' at the birth of computing). I spent the weekend playing around with some ideas of my own and came up with these algorithms, where I'm almost at the speed I can move data

__m256d TRANSPOSE4 Equivalent?

自作多情 提交于 2019-12-21 05:48:04
问题 Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner. _MM_TRANSPOSE4_PS Code #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \ __m128 tmp3, tmp2, tmp1, tmp0; \ \ tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \ tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \ tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \ tmp3 = _mm_shuffle_ps((row2), (row3),

Half-precision floating-point arithmetic on Intel chips

假装没事ソ 提交于 2019-12-21 04:57:32
问题 Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers. [1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats 回答1: Is it possible to perform half-precision floating-point arithmetic on Intel chips? Yes, apparently the on-chip GPU in Skylake and later

Using AVX instructions disables exp() optimization?

主宰稳场 提交于 2019-12-21 04:51:40
问题 I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be

Are older SIMD-versions available when using newer ones?

谁说胖子不能爱 提交于 2019-12-21 03:59:51
问题 When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? 回答1: In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports