simd

How do I efficiently lookup 16bits in a 128bit SIMD vector? [duplicate]

耗尽温柔 提交于 2020-04-30 06:29:30
问题 This question already has answers here : SSE/SIMD shift with one-byte element size / granularity? (2 answers) How do I vectorize data_i16[0 to 15]? (1 answer) Closed 3 days ago . I'm trying to implement the strategy described in an answer to How do I vectorize data_i16[0 to 15]? Code below. The spot I'd like to fix is the for(int i=0; i<ALIGN; i++) loop I'm new to SIMD. From what I can tell I'd load the high/low nibble table by writing const auto HI_TBL = _mm_load_si128((__m128i*)HighNibble)

Micro Optimization of a 4-bucket histogram of a large array or list

旧城冷巷雨未停 提交于 2020-04-25 11:30:27
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

Micro Optimization of a 4-bucket histogram of a large array or list

一笑奈何 提交于 2020-04-25 11:28:36
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

Micro Optimization of a 4-bucket histogram of a large array or list

被刻印的时光 ゝ 提交于 2020-04-25 11:27:04
问题 I have a special question. I will try to describe this as accurate as possible. I am doing a very important "micro-optimization". A loop that runs for days at a time. So if I can cut this loops time it takes to half the time. 10 days would decrease to only 5 days etc. The loop I have now is the function: "testbenchmark1". I have 4 indexes that I need to increase in a loop like this. But when accessing an index from a list that takes some extra time actually as I have noticed. This is what I

WebAssembly系列2-为什么我们需要WebAssembly—采访Brendan Eich

坚强是说给别人听的谎言 提交于 2020-04-13 17:07:28
【今日推荐】:为什么一到面试就懵逼!>>> 在2015.06.17, JavaScript之父Brendan Eich宣布了一个新项目:将新的底层功能(low level primitives)带入web中[1],  这将使在浏览器或其他JavaScript环境中采用类似于C & C++等语言编写的项目的编译更加容易。如果是第一次听到这个想法,可参见阅读"What is WebAssembly”的基本描述[3]。 WebAssembly团队包括来自Google, Microsoft, Mozilla, Apple, 以及W3C WebAssembly Community Group( 社区团体) 的成员。 这份声明使web开发界猜测WebAssembly可能会如何影响到JavaScript的未来, Eric Elliott (下面简称EE) 采访了Brendan Eich(下面简称BE). EE: 最近,你在博客中宣布将开始WebAssembly新项目,基本上就是一个web汇编语言,一个底层编译目标。 能否讲下它是做什么的,背后的动机是什么。 BE: 这在某种程度上是从ASM.js就开始的一个持续过程。ASM.js是JavaScript的一个子集,没有对象, 没有垃圾回收, 没有JIT编译器中断(just in time compiler pauses)。它的目标是C/C++

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

北城以北 提交于 2020-04-09 17:57:16
问题 avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256 ), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256 ). They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical. So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a

Why both? vperm2f128 (avx) vs vperm2i128 (avx2)

亡梦爱人 提交于 2020-04-09 17:57:08
问题 avx introduced the instruction vperm2f128 (exposed via _mm256_permute2f128_si256 ), while avx2 introduced vperm2i128 (exposed via _mm256_permute2x128_si256 ). They both seem to be doing exactly the same, and their respective latencies and throughputs also seem to be identical. So why do both instructions exist? There has to be some reasoning behind that? Is there maybe something I have overlooked? Given that avx2 operates on data structures introduced with avx, I cannot imagine that a

What do you do without fast gather and scatter in AVX2 instructions?

一笑奈何 提交于 2020-04-08 09:52:11
问题 I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter instructions but I'm limited to AVX2 hardware for a x86 architecture (I know AVX-512 has these though I'd not sure how fast they are). #include <stdint.h> #include <immintrin.h> #define USE_AVX2 // Sieve the bits in array sieveX for later use void sieveFactors

Matrix transpose and population count

点点圈 提交于 2020-03-16 07:27:31
问题 I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column. For instance for n=4: 1101 0101 0001 1001 M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} }; result = { 2, 2, 0, 4}; I can obviously transpose the matrix M into a matrix M' popcount each row of M'. Good algorithms exist for matrix transposition and popcounting through bit manipulation. My question is: would it be possible to "merge" such algorithms into a

BMI for generating masks with AVX512

一世执手 提交于 2020-02-15 07:41:03
问题 I was inspired by this link https://www.sigarch.org/simd-instructions-considered-harmful/ to look into how AVX512 performs. My idea was that the clean up loop after the loop could be removed using the AVX512 mask operations. Here is the code I am using void daxpy2(int n, double a, const double x[], double y[]) { __m512d av = _mm512_set1_pd(a); int r = n&7, n2 = n - r; for(int i=-n2; i<0; i+=8) { __m512d yv = _mm512_loadu_pd(&y[i+n2]); __m512d xv = _mm512_loadu_pd(&x[i+n2]); yv = _mm512_fmadd