simd

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

大城市里の小女人 提交于 2019-12-22 10:53:59
问题 I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling. I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0) From IACA output I guess the difference is that IVB uses port 1 and 5 for

Help me improve some more SSE2 code

别说谁变了你拦得住时间么 提交于 2019-12-22 10:38:45
问题 I am looking for some help to improve this bilinear scaling sse2 code on core2 cpus On my Atom N270 and on an i7 this code is about 2x faster than the mmx code. But under core2 cpus it is only equal to the mmx code. Code follows void ConversionProcess::convert_SSE2(BBitmap *from, BBitmap *to) { uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr; ULLint start = rdtsc(); ULLint stop; if (from && to) { uint32 width, height; width = from->Bounds().IntegerWidth() + 1; height = from->Bounds()

Fast in-register sort of bytes?

橙三吉。 提交于 2019-12-22 05:18:19
问题 Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions. Thanks in advance. 回答1: Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.) 回答2: Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting

Efficient way of rotating a byte inside an AVX register

一曲冷凌霜 提交于 2019-12-22 04:49:16
问题 Summary/tl;dr: Is there any way to rotate a byte in an YMM register bitwise (using AVX), other than doing 2x shifts and blending the results together? For each 8 bytes in an YMM register, I need to left-rotate 7 bytes in it. Each byte needs to be rotated one bit more to the left than the former. Thus the 1 byte should be rotated 0 bits and the seventh should be rotated 6 bits. Currently, I have made an implementation that does this by [I use the 1-bit rotate as an example here] shifting the

SIMD/SSE: How to check that all vector elements are non-zero

巧了我就是萌 提交于 2019-12-22 03:45:59
问题 I need to check that all vector elements are non-zero. So far I found following solution. Is there a better way to do this? I am using gcc 4.8.2 on Linux/x86_64, instructions up to SSE4.2. typedef char ChrVect __attribute__((vector_size(16), aligned(16))); inline bool testNonzero(ChrVect vect) { const ChrVect vzero = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}; return (0 == (__int128_t)(vzero == vect)); } Update: code above is compiled to following assembler code (when compiled as non-inline function):

Vectorize a function in clang

情到浓时终转凉″ 提交于 2019-12-22 03:40:54
问题 I am trying to vectorize the following function with clang according to this clang reference. It takes a vector of byte array and applies a mask according to this RFC. static void apply_mask(vector<uint8_t> &payload, uint8_t (&masking_key)[4]) { #pragma clang loop vectorize(enable) interleave(enable) for (size_t i = 0; i < payload.size(); i++) { payload[i] = payload[i] ^ masking_key[i % 4]; } } The following flags are passed to clang: -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize

Fastest way to compute distance squared

时间秒杀一切 提交于 2019-12-22 01:52:39
问题 My code relies heavily on computing distances between two points in 3D space. To avoid the expensive square root I use the squared distance throughout. But still it takes up a major fraction of the computing time and I would like to replace my simple function with something even faster. I now have: double distance_squared(double *a, double *b) { double dx = a[0] - b[0]; double dy = a[1] - b[1]; double dz = a[2] - b[2]; return dx*dx + dy*dy + dz*dz; } I also tried using a macro to avoid the

Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?

夙愿已清 提交于 2019-12-22 01:36:42
问题 As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf WARP in CUDA: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#simt-architecture 4.1. SIMT Architecture ... A warp executes one common instruction at a time , so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially

Shift elements to the left of a SIMD register based on boolean mask

蓝咒 提交于 2019-12-22 00:36:03
问题 This question is related to this: Optimal uint8_t bitmap into a 8 x 32bit SIMD "bool" vector I would like to create an optimal function with this signature: __m256i PackLeft(__m256i inputVector, __m256i boolVector); The desired behaviour is that on an input of 64bit int like this: inputVector = {42, 17, 13, 3} boolVector = {true, false, true, false} It masks all values that have false in the boolVector and then repacks the values that remain to the left. On the output above, the return value

Realistic deadlock example in CUDA/OpenCL

拈花ヽ惹草 提交于 2019-12-21 22:31:26
问题 For a tutorial I'm writing, I'm looking for a "realistic" and simple example of a deadlock caused by ignorance of SIMT / SIMD. I came up with this snippet, which seems to be a good example. Any input would be appreciated. … int x = threadID / 2; if (threadID > x) { value[threadID] = 42; barrier(); } else { value2[threadID/2] = 13 barrier(); } result = value[threadID/2] + value2[threadID/2]; I know, it is neither proper CUDA C nor OpenCL C. 回答1: A simple deadlock that is actually easy to catch