avx | 易学教程

How can I convert a vector of float to short int using avx instructions?

阅读更多关于 How can I convert a vector of float to short int using avx instructions?

问题 Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256 , while result is of type short int* or short int[8] . for(i = 0; i < 8; i++) result[i] = (short int)result_in_float[i]; I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those

How can I convert a vector of float to short int using avx instructions?

阅读更多关于 How can I convert a vector of float to short int using avx instructions?

g++: No Such Instruction with AVX

阅读更多关于 g++: No Such Instruction with AVX

问题 When I compiled a program I was writing in C++ (for the latest Macbook pro, which of course supports the AVX instruction set), I got the following errors. I am using the latest release of g++ obtained from Macports. Do you have any ideas as to what I can do to fix the error without restricting the instruction sets available to the compiler? Is there any package in particular that I should try to update? g++-mp-4.7 -std=c++11 -Wall -Ofast -march=native -fno-rtti src/raw_to_json.cpp -o bin/raw

Efficient way of rotating a byte inside an AVX register

阅读更多关于 Efficient way of rotating a byte inside an AVX register

问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the

Shift elements to the left of a SIMD register based on boolean mask

阅读更多关于 Shift elements to the left of a SIMD register based on boolean mask

问题 This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value

Fastest 64-bit population count (Hamming weight)

阅读更多关于 Fastest 64-bit population count (Hamming weight)

问题 I had to calculate the Hamming weight for a quite fast continious flow of 64-bit data and using the popcnt assembly instruction throws me a exception om my Intel Core i7-4650U. I checked my bible Hacker's delight and scanned the web for all kinds of algorithms (it's a bunch out there since they started tackling this 'problem' at the birth of computing). I spent the weekend playing around with some ideas of my own and came up with these algorithms, where I'm almost at the speed I can move data

__m256d TRANSPOSE4 Equivalent?

阅读更多关于 __m256d TRANSPOSE4 Equivalent?

问题 Intel has included __MM_TRANPOSE4_PS to transpose a 4x4 matrix of vectors. I'm wanting to do the equivalent with __m256d. However, I can't seem to figure out how to get _mm256_shuffle_pd in the same manner. _MM_TRANSPOSE4_PS Code #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) { \ __m128 tmp3, tmp2, tmp1, tmp0; \ \ tmp0 = _mm_shuffle_ps((row0), (row1), 0x44); \ tmp2 = _mm_shuffle_ps((row0), (row1), 0xEE); \ tmp1 = _mm_shuffle_ps((row2), (row3), 0x44); \ tmp3 = _mm_shuffle_ps((row2), (row3),

Half-precision floating-point arithmetic on Intel chips

阅读更多关于 Half-precision floating-point arithmetic on Intel chips

问题 Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers. [1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats 回答1: Is it possible to perform half-precision floating-point arithmetic on Intel chips? Yes, apparently the on-chip GPU in Skylake and later

Using AVX instructions disables exp() optimization?

阅读更多关于 Using AVX instructions disables exp() optimization?

问题 I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be

Are older SIMD-versions available when using newer ones?

阅读更多关于 Are older SIMD-versions available when using newer ones?

问题 When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately? 回答1: In general, these have been additive but keep in mind that there are differences between Intel and AMD support for these over the years. If you have AVX, then you can assume SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE 4.2 as well. Remember that to use AVX you also need to validate the OSXSAVE CPUID bit is set to ensure the OS you are using actually supports