simd

How to transpose a 16x16 matrix using SIMD instructions?

安稳与你 提交于 2019-12-17 23:47:01
问题 I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions? There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512. Any ideas? 回答1: For two operand

How to compare __m128 types?

给你一囗甜甜゛ 提交于 2019-12-17 22:56:08
问题 __m128 a; __m128 b; How to code a != b ? what to use: _mm_cmpneq_ps or _mm_cmpneq_ss ? How to process the result ? Can't find adequate docs. 回答1: You should probably use _mm_cmpneq_ps . However the interpretation of comparisons is a little different with SIMD code than with scalar code. Do you want to test for any corresponding element not being equal ? Or all corresponding elements not being equal ? To test the results of the 4 comparisons from _mm_cmpneq_ps you can use _mm_movemask_epi8 .

The indices of non-zero bytes of an SSE/AVX register

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-17 20:45:59
问题 If an SSE/AVX register's value is such that all its bytes are either 0 or 1, is there any way to efficiently get the indices of all non zero elements? For example, if xmm value is | r0=0 | r1=1 | r2=0 | r3=1 | r4=0 | r5=1 | r6=0 |...| r14=0 | r15=1 | the result should be something like (1, 3, 5, ... , 15). The result should be placed in another _m128i variable or char[16] array. If it helps, we can assume that register's value is such that all bytes are either 0 or some constant nonzero value

Fast dot product using SSE/AVX intrinsics

ⅰ亾dé卋堺 提交于 2019-12-17 19:42:57
问题 I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different. We use structs which are 16 byte aligned. Code excerpt (simplified): struct float3 { float x, y, z, w; // 4th component unused here } struct float4 { float x, y, z, w; } In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following

Efficient SSE NxN matrix multiplication

孤人 提交于 2019-12-17 19:36:49
问题 I'm trying to implement SSE version of large matrix by matrix multiplication. I'm looking for an efficient algorithm based on SIMD implementations. My desired method looks like: A(n x m) * B(m x k) = C(n x k) And all matrices are considered to be 16-byte aligned float array. I searched the net and found some articles describing 8x8 multiplication and even smaller. I really need it as efficient as possible and I don't want to use Eigen library or similar libraries. (Only SSE3 to be more

How to quickly count bits into separate bins in a series of ints on Sandy Bridge? [duplicate]

一世执手 提交于 2019-12-17 19:18:53
问题 This question already has answers here : Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2 (5 answers) Closed last month . Update: Please read the code, it is NOT about counting bits in one int Is it possible to improve performance of the following code with some clever assembler? uint bit_counter[64]; void Count(uint64 bits) { bit_counter[0] += (bits >> 0) & 1; bit_counter[1] += (bits >> 1) & 1; // .. bit_counter[63] += (bits >> 63) & 1; } Count is in the

SSE: convert short integer to float

99封情书 提交于 2019-12-17 18:55:33
问题 I want to convert an array of unsigned short numbers to float using SSE. Let's say __m128i xVal; // Has 8 16-bit unsigned integers __m128 y1, y2; // 2 xmm registers for 8 float values I want first 4 uint16 in y1 & next 4 uint16 in y2. Need to know which sse intrinsic to use. 回答1: You need to first unpack your vector of 8 x 16 bit unsigned shorts into two vectors of 32 bit unsigned ints, then convert each of these vectors to float: __m128i xlo = _mm_unpacklo_epi16(x, _mm_set1_epi16(0)); _

Fast 24-bit array -> 32-bit array conversion?

喜欢而已 提交于 2019-12-17 18:20:04
问题 Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixels (either as YUV or RGB images), but DX10 takes 32-bit pixel textures. So, I have to expand the 24-bit values to 32-bits before I can load them into the GPU. I

Fast counting the number of set bits in __m128i register

社会主义新天地 提交于 2019-12-17 18:09:29
问题 I should count the number of set bits of a __m128i register. In particular, I should write two functions that are able to count the number of bits of the register, using the following ways. The total number of set bits of the register. The number of set bits for each byte of the register. Are there intrinsic functions that can perform, wholly or partially, the above operations? 回答1: Here are some codes I used in an old project (there is a research paper about it). The function popcnt8 below

GNU C native vectors: how to broadcast a scalar, like x86's _mm_set1_epi16

空扰寡人 提交于 2019-12-17 16:59:28
问题 How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic? typedef uint16_t v8su __attribute__((vector_size(16))); v8su set1_u16_x86(uint16_t scalar) { return (v8su)_mm_set1_epi16(scalar); // cast needed for gcc } Surely there must be a better way than v8su set1_u16(uint16_t s) { return (v8su){s,s,s,s, s,s,s,s}; } I don't want to write an AVX2 version of that for broadcasting a single byte! Even a gcc-only or clang-only answer to this part