simd

memcpy moving 128 bit in linux

倾然丶 夕夏残阳落幕 提交于 2019-12-06 07:04:19
问题 I'm writing a device driver in linux for a PCIe device. This device driver performs several read and write to test the throughput. When I use the memcpy, the maximum payload for a TLP is 8 bytes ( on 64 bits architectures ). In my opinion the only way to get a payload of 16 bytes is to use the SSE instruction set. I've already seen this but the code doesn't compile ( AT&T/Intel syntax issue ). There is a way to use that code inside linux ? Does anyone know where I can found an implementation

Existence of “simd reduction(:)” In GCC and MSVC?

不想你离开。 提交于 2019-12-06 06:43:58
问题 simd pragma can be used with icc compiler to perform a reduction operator: #pragma simd #pragma simd reduction(+:acc) #pragma ivdep for(int i( 0 ); i < N; ++i ) { acc += x[i]; } Is there any equivalent solution in msvc or/and gcc? Ref(p28): http://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf 回答1: GCC definitely can vectorize. Suppose you have file reduc.c with contents: int foo(int *x, int N) { int acc, i; for( i = 0; i < N; ++i ) { acc += x[i]; } return acc; }

How do I Perform Integer SIMD operations on the iPad A4 Processor?

跟風遠走 提交于 2019-12-06 05:07:45
问题 I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor? Thanks, Doug 回答1: To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html Note that the

SSE2 8x8 byte-matrix transpose code twice as slow on Haswell+ then on ivy bridge

南楼画角 提交于 2019-12-06 04:54:48
I've code with a lot of punpckl, pextrd and pinsrd that rotates a 8x8 byte matrix as part of a larger routine that rotates a B/W image with looptiling. I profiled it with IACA to see if it was worth doing a AVX2 routine for, and surprisingly the code is almost twice times as slow on Haswell/Skylake than on IVB (IVB:19.8, HSW,SKL: 36 cycles). (IVB+HSW using iaca 2.1, skl using 3.0, but hsw gives same number with 3.0) From IACA output I guess the difference is that IVB uses port 1 and 5 for above instructions, while haswell only uses port 5. I googled a bit, but couldn't find a explanation. Is

How to speed up calculation of integral image?

烈酒焚心 提交于 2019-12-06 03:22:21
问题 I often need to calculate integral image. This is simple algorithm: uint32_t void integral_sum(const uint8_t * src, size_t src_stride, size_t width, size_t height, uint32_t * sum, size_t sum_stride) { memset(sum, 0, (width + 1) * sizeof(uint32_t)); sum += sum_stride + 1; for (size_t row = 0; row < height; row++) { uint32_t row_sum = 0; sum[-1] = 0; for (size_t col = 0; col < width; col++) { row_sum += src[col]; sum[col] = row_sum + sum[col - sum_stride]; } src += src_stride; sum += sum_stride

QWORD shuffle sequential 7-bits to byte-alignment with SIMD SSE…AVX

泪湿孤枕 提交于 2019-12-06 02:09:45
I would like to know if the following is possible in any of the SIMD families of instructions. I have a qword input with 63 significant bits (never negative). Each sequential 7 bits starting from the LSB is shuffle-aligned to a byte, with a left-padding of 1 (except for the most significant non-zero byte). To illustrate, I'll use letters for clarity's sake. The result is only the significant bytes, thus 0 - 9 in size, which is converted to a byte array. In: 0|kjihgfe|dcbaZYX|WVUTSRQ|PONMLKJ|IHGFEDC|BAzyxwv|utsrqpo|nmlkjih|gfedcba Out: 0kjihgfe|1dcbaZYX|1WVUTSRQ|1PONMLKJ|1IHGFEDC|1BAzyxwv

Count leading zeros in __m256i word

白昼怎懂夜的黑 提交于 2019-12-06 00:53:03
问题 I'm tinkering around with AVX-2 instructions and I'm looking for a fast way to count the number of leading zeros in a __m256i word (which has 256 bits). So far, I have figured out the following way: // Computes the number of leading zero bits. // Here, avx_word is of type _m256i. if (!_mm256_testz_si256(avx_word, avx_word)) { uint64_t word = _mm256_extract_epi64(avx_word, 0); if (word > 0) return (__builtin_clzll(word)); word = _mm256_extract_epi64(avx_word, 1); if (word > 0) return (_

AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?

蹲街弑〆低调 提交于 2019-12-05 23:43:49
问题 I have an aligned array of integers in memory containing indices I0, I1, I2, I3. My goal is to get them into a __m256i register containing I0, I0 + 1, I1, I1 + 1, I2, I2 + 1, I3, I3 + 1. The hard part is getting them into the 256 bit register as I0, I0, I1, I1, I2, I2, I3, I3, after which I can just add a register containing 0, 1, 0, 1, 0, 1, 0, 1. I found the intrinsic, _mm256_castsi128_si256, which lets me load the 4 integers into the lower 128 bits of the 256 bit register, but I'm

Array Error - Access violation reading location 0xffffffff

风格不统一 提交于 2019-12-05 23:15:47
I have previously used SIMD operators to improve the efficiency of my code, however I am now facing a new error which I cannot resolve. For this task, speed is paramount. The size of the array will not be known until the data is imported, and may be very small (100 values) or enormous (10 million values). For the latter case, the code works fine, however I am encountering an error when I use fewer than 130036 array values. Does anyone know what is causing this issue and how to resolve it? I have attached the (tested) code involved, which will be used later in a more complicated function. The

Help me improve some more SSE2 code

烈酒焚心 提交于 2019-12-05 22:01:08
I am looking for some help to improve this bilinear scaling sse2 code on core2 cpus On my Atom N270 and on an i7 this code is about 2x faster than the mmx code. But under core2 cpus it is only equal to the mmx code. Code follows void ConversionProcess::convert_SSE2(BBitmap *from, BBitmap *to) { uint32 fromBPR, toBPR, fromBPRDIV4, x, y, yr, xr; ULLint start = rdtsc(); ULLint stop; if (from && to) { uint32 width, height; width = from->Bounds().IntegerWidth() + 1; height = from->Bounds().IntegerHeight() + 1; uint32 toWidth, toHeight; toWidth = to->Bounds().IntegerWidth() + 1; toHeight = to-