avx2

Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

送分小仙女□ 提交于 2019-12-01 18:15:40
I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed? No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it in somewhat further for scalar

Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

别来无恙 提交于 2019-12-01 17:51:36
问题 I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed? 回答1: No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

狂风中的少年 提交于 2019-12-01 15:16:23
In the Intel intrinsics webapp , several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency Throughput Haswell 3 - Ivy Bridge 1 - Sandy Bridge 1 - I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this model further? Peter Cordes TL:DR : all lane-crossing shuffles / inserts / extracts have 3c latency

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

随声附和 提交于 2019-12-01 14:08:50
问题 In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following: Performance Architecture Latency Throughput Haswell 3 - Ivy Bridge 1 - Sandy Bridge 1 - I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this

What is the minimum version of OS X for use with AVX/AVX2?

坚强是说给别人听的谎言 提交于 2019-12-01 11:26:16
I have an image drawing routine which is compiled multiple times for SSE, SSE2, SSE3, SSE4.1, SSE4.2, AVX and AVX2. My program dynamically dispatches one of these binary variations by checking CPUID flags. On Windows, I check the version of Windows and disable AVX/AVX2 dispatch if the OS doesn't support them. (For example, only Windows 7 SP1 or later supports AVX/AVX2.) I want to do the same thing on Mac OS X, but I'm not sure what version of OS X supports AVX/AVX2. Note that what I want to know is the minimum version of OS X for use with AVX/AVX2. Not machine models which are capable of AVX

Efficient way to set first N or last N bits of __m256i to 1, the rest to 0

本小妞迷上赌 提交于 2019-12-01 03:53:38
How to set to 1 efficiently with AVX2 first N bits last N bits of __m256i , setting the rest to 0 ? These are 2 separate operations for tail and head of a bit range, when the range may start and end in the middle of __m256i value. The part of the range occupying full __m256i values is processed with all- 0 or all- 1 masks. The AVX2 shift instructions vpsllvd and vpsrlvd have the nice property that shift counts greater than or equal to 32 lead to zero integers within the ymm register. In other words: the shift counts are not masked, in contrast to the shift counts for the x86 scalar shift

Fallback implementation for conflict detection in AVX2

喜你入骨 提交于 2019-12-01 03:31:29
AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict. Basically I need an AVX2 equivalent for __mm256i detect_conflict(__mm256i a) { __mm256i cd = _mm256_conflict_epi32(a); return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0)); } The only way I could think of is to use

How can I add together two SSE registers

南笙酒味 提交于 2019-12-01 00:48:01
问题 I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word. Is there any way to perform this operation, even if multiple instructions are needed? I was thinking about using _mm_add_epi64 , detecting overflow in the right word and then

Fallback implementation for conflict detection in AVX2

你说的曾经没有我的故事 提交于 2019-11-30 23:20:04
问题 AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2? I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict. Basically I need an AVX2 equivalent for __mm256i detect_conflict(__mm256i a) { __mm256i cd = _mm256_conflict_epi32

8 bit shift operation in AVX2 with shifting in zeros

爱⌒轻易说出口 提交于 2019-11-30 18:20:59
Is there any way to rebuild the _mm_slli_si128 instruction in AVX2 to shift an __mm256i register by x bytes? The _mm256_slli_si256 seems just to execute two _mm_slli_si128 on a[127:0] and a[255:128]. The left shift should work on a __m256i like this: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ..., 32] -> [2, 3, 4, 5, 6, 7, 8, 9, ..., 0] I saw in thread that it is possible to create a shift with _mm256_permutevar8x32_ps for 32bit. But I need a more generic solution to shift by x bytes. Has anybody already a solution for this problem? okay I implemented a function that can shift left up to 16 byte.