simd | 易学教程

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

阅读更多关于 SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

问题 I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling. I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0) From IACA output I guess the difference is that IVB uses port 1 and 5 for

Help me improve some more SSE2 code

阅读更多关于 Help me improve some more SSE2 code

问题 I am looking for some help to improve this bilinear scaling sse2 code on core2 cpus On my Atom N270 and on an i7 this code is about 2x faster than the mmx code. But under core2 cpus it is only equal to the mmx code. Code follows void ConversionProcess::convert_SSE2(BBitmap *from, BBitmap *to) { uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr; ULLint start = rdtsc(); ULLint stop; if (from && to) { uint32 width, height; width = from->Bounds().IntegerWidth() + 1; height = from->Bounds()

Fast in-register sort of bytes?

阅读更多关于 Fast in-register sort of bytes?

问题 Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions. Thanks in advance. 回答1: Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.) 回答2: Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting

Efficient way of rotating a byte inside an AVX register

阅读更多关于 Efficient way of rotating a byte inside an AVX register

问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the

SIMD/SSE: How to check that all vector elements are non-zero

阅读更多关于 SIMD/SSE: How to check that all vector elements are non-zero

问题 I need to check that all vector elements are non-zero. So far I found following solution. Is there a better way to do this? I am using gcc 4.8.2 on Linux/x86_64, instructions up to SSE4.2. typedef char ChrVect __attribute__((vector_size(16), aligned(16))); inline bool testNonzero(ChrVect vect) { const ChrVect vzero = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; return (0 == (__int128_t)(vzero == vect)); } Update: code above is compiled to following assembler code (when compiled as non-inline function):

Vectorize a function in clang

阅读更多关于 Vectorize a function in clang

问题 I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC. static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) { #pragma clang loop vectorize(enable) interleave(enable) for (size_t i = 0; i < payload.size(); i++) { payload[i] = payload[i] ^ masking_key[i % 4]; } } The following flags are passed to clang: -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize

Fastest way to compute distance squared

阅读更多关于 Fastest way to compute distance squared

问题 My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster. I now have: double distance_squared(double *a, double *b) { double dx = a[0] - b[0]; double dy = a[1] - b[1]; double dz = a[2] - b[2]; return dx*dx + dy*dy + dz*dz; } I also tried using a macro to avoid the

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

阅读更多关于 Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

问题 As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture 4.1. SIMT Architecture ... A warp executes one common instruction at a time , so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially

Shift elements to the left of a SIMD register based on boolean mask

阅读更多关于 Shift elements to the left of a SIMD register based on boolean mask

问题 This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value

Realistic deadlock example in CUDA/OpenCL

阅读更多关于 Realistic deadlock example in CUDA/OpenCL

问题 For a tutorial I'm writing, I'm looking for a "realistic" and simple example of a deadlock caused by ignorance of SIMT / SIMD. I came up with this snippet, which seems to be a good example. Any input would be appreciated. … int x = threadID / 2; if (threadID > x) { value[threadID] = 42; barrier(); } else { value2[threadID/2] = 13 barrier(); } result = value[threadID/2] + value2[threadID/2]; I know, it is neither proper CUDA C nor OpenCL C. 回答1: A simple deadlock that is actually easy to catch