simd | 易学教程

OpenCL distribution

阅读更多关于 OpenCL distribution

问题 I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions. My question however is regarding which OpenCL-implementation to use. E.g

Testing equality between two __m128i variables

阅读更多关于 Testing equality between two __m128i variables

问题 If I want to do a bitwise equality test between two __m128i variables, am I required to use an SSE instruction or can I use == ? If not, which SSE instruction should I use? 回答1: Although using _mm_movemask_epi8 is one solution, if you have a processor with SSE4.1 I think a better solution is to use an instruction which sets the zero or carry flag in the FLAGS register. This saves a test or cmp instruction. To do this you could do this: if(_mm_test_all_ones(_mm_cmpeq_epi8(v1,v2))) { //v0 == v1

Branch and predicated instructions

阅读更多关于 Branch and predicated instructions

问题 Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to better performance than the other. This comment suggests that branch instructions lead to a greater number of executed instructions, stalling due to "branch address resolution and fetch", and overhead due to "the branch itself" and "book keeping for

Reverse a AVX register containing doubles using a single AVX intrinsic

阅读更多关于 Reverse a AVX register containing doubles using a single AVX intrinsic

问题 If I have an AVX register with 4 doubles in them and I want to store the reverse of this in another register, is it possible to do this with a single intrinsic command? For example: If I had 4 floats in a SSE register, I could use: _mm_shuffle_ps(A,A,_MM_SHUFFLE(0,1,2,3)); Can I do this using, maybe _mm256_permute2f128_pd() ? I don't think you can address each individual double using the above intrinsic. 回答1: You actually need 2 permutes to do this: _mm256_permute2f128_pd() only permutes in

Shifting 4 integers right by different values SIMD

阅读更多关于 Shifting 4 integers right by different values SIMD

问题 SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this. i[0] = i[0] & 0b111111; i[1] = (i[1]>>6) & 0b111111; i[2] = (i[2]>>12) & 0b111111; i[3] = (i[3]>>18) & 0b111111; Essentially trying to isolate a different group of 6 bits in each integer. So what is the optimal solution? Things I thought about: You can simulate a variable

Sparse array compression using SIMD (AVX2)

阅读更多关于 Sparse array compression using SIMD (AVX2)

问题 I have a sparse array a (mostly zeroes): unsigned char a[1000000]; and I would like to create an array b of indexes to non-zero elements of a using SIMD instructions on Intel x64 architecture with AVX2. I'm looking for tips how to do it efficiently. Specifically, are there SIMD instruction(s) to get positions of consecutive non-zero elements in SIMD register, arranged contiguously? 回答1: Five methods to compute the indices of the nonzeros are: Semi vectorized loop: Load a SIMD vector with

Compilation of a simple c++ program using SSE intrinsics

阅读更多关于 Compilation of a simple c++ program using SSE intrinsics

问题 I am new to the SSE instructions and I was trying to learn them from this site: http://www.codeproject.com/Articles/4522/Introduction-to-SSE-Programming I am using the GCC compiler on Ubuntu 10.10 with an Intel Core i7 960 CPU Here is a code based on the article which I attempted: For two arrays of length ARRAY_SIZE it calculates fResult[i] = sqrt( fSource1[i]*fSource1[i] + fSource2[i]*fSource2[i] ) + 0.5 Here is the code #include <iostream> #include <iomanip> #include <ctime> #include

Fast vectorized conversion from RGB to BGRA

阅读更多关于 Fast vectorized conversion from RGB to BGRA

问题 In a follow-up to some previous questions on converting RGB to RGBA, and ARGB to BGR, I would like to speed up a RGB to BGRA conversion with SSE . Assume a 32-bit machine, and would like to use intrinsics . I'm having difficulty aligning both source and destination buffers to work with 128-bit registers, and seek for other savvy vectorization solutions. The routine to be vectorized is as follows... void RGB8ToBGRX8(int w, const void *in, void *out) { int i; int width = w; const unsigned char

Fast counting the number of equal bytes between two arrays [duplicate]

阅读更多关于 Fast counting the number of equal bytes between two arrays [duplicate]

问题 This question already has answers here : Can counting byte matches between two strings be optimized using SIMD? (3 answers) Closed 8 months ago . I wrote the function int compare_16bytes(__m128i lhs, __m128i rhs) in order to compare two 16 byte numbers using SSE instructions: this function returns how many bytes are equal after performing the comparison. Now I would like use the above function in order to compare two byte arrays of arbitrary length: the length may not be a multiple of 16

How to transpose a 16x16 matrix using SIMD instructions?

阅读更多关于 How to transpose a 16x16 matrix using SIMD instructions?

问题 I'm currently writing some code targeting Intel's forthcoming AVX-512 SIMD instructions, which supports 512-bit operations. Now assuming there's a matrix represented by 16 SIMD registers, each holding 16 32-bit integers (corresponds to a row), how can I transpose the matrix with purely SIMD instructions? There're already solutions to transposing 4x4 or 8x8 matrices with SSE and AVX2 respectively. But I couldn't figure out how to extend it to 16x16 with AVX-512. Any ideas? 回答1: For two operand