intrinsics | 易学教程

How to emulate _mm256_loadu_epi32 with gcc or clang?

阅读更多关于 How to emulate _mm256_loadu_epi32 with gcc or clang?

问题 Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32: _m256i _mm256_loadu_epi32 (void const* mem_addr); /* Instruction: vmovdqu32 ymm, m256 CPUID Flags: AVX512VL + AVX512F Description Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst. mem_addr does not need to be aligned on any particular boundary. Operation a[255:0] := MEM[mem_addr+255:mem_addr] dst[MAX:256] := 0 */ But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin

Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

阅读更多关于 Is there a more efficient way to broadcast 4 contiguous doubles into 4 YMM registers?

问题 In a piece of C++ code that does something similar to (but not exactly) matrix multiplication, I load 4 contiguous doubles into 4 YMM registers like this: # a is a 64-byte aligned array of double __m256d b0 = _mm256_broadcast_sd(&b[4*k+0]); __m256d b1 = _mm256_broadcast_sd(&b[4*k+1]); __m256d b2 = _mm256_broadcast_sd(&b[4*k+2]); __m256d b3 = _mm256_broadcast_sd(&b[4*k+3]); I compiled the code with gcc-4.8.2 on a Sandy Bridge machine. Hardware event counters (Intel PMU) suggests that the CPU

How to OR all lane of a NEON vector

阅读更多关于 How to OR all lane of a NEON vector

问题 I want to use NEON intrinsics to optimize the following code. uint32x4_t c1; // 4 elements, each element is 0 or 1 uint32x4_t c2; // 4 elements, each element is 0 or 1 uint8_t pack = 0; // unsigned char, for result /* some code /* // need optimizing pack |= (vgetq_lane_u32(c1, 0); pack |= (vgetq_lane_u32(c1, 1) << 1; pack |= (vgetq_lane_u32(c1, 2) << 2; pack |= (vgetq_lane_u32(c1, 3) << 3; pack |= (vgetq_lane_u32(c2, 0) << 4; pack |= (vgetq_lane_u32(c2, 1) << 5; pack |= (vgetq_lane_u32(c2, 2)

Segmentation fault (core dumped) when using avx on an array allocated with new[]

阅读更多关于 Segmentation fault (core dumped) when using avx on an array allocated with new[]

问题 When I run this code in visual studio 2015, the code works correctly.But the code generates the following error in codeblocks : Segmentation fault(core dumped). I also ran the code in ubuntu with same error. #include <iostream> #include <immintrin.h> struct INFO { unsigned int id = 0; __m256i temp[8]; }; int main() { std::cout<<"Start AVX..."<<std::endl; int _size = 100; INFO *info = new INFO[_size]; for (int i = 0; i<_size; i++) { for (int k = 0; k < 8; k++) { info[i].temp[k] = _mm256_setr

Using % with SSE2?

阅读更多关于 Using % with SSE2?

问题 Here's the code I'm trying to convert to SSE2: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left = audioLeft; double *right = audioRight; double phase = 0.0; double bp0 = mNoteFrequency * mHostPitch; for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) { // some other code (that will use phase) phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI); while (phase >= TWOPI) { phase -= TWOPI; } } Here's what I

Using % with SSE2?

阅读更多关于 Using % with SSE2?

Using % with SSE2?

阅读更多关于 Using % with SSE2?

_mm_alignr_epi8 (PALIGNR) equivalent in AVX2

阅读更多关于 _mm_alignr_epi8 (PALIGNR) equivalent in AVX2

问题 In SSE3, the PALIGNR instruction performs the following: PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination. I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the

Intel SIMD - How can I check if an __m256* contains any non-zero values

阅读更多关于 Intel SIMD - How can I check if an __m256* contains any non-zero values

问题 I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work. 回答1: _mm256_testz_ps just tests the sign bits - in order to test the values you'll need to compare against 0 and then extract the resulting mask, e.g. __m256 vcmp = _mm256_cmp_ps(*pSrc1, _mm256_set1_ps(0.0f), _CMP_EQ_OQ); int mask = _mm256_movemask_ps(vcmp); bool any_nz = mask != 0xff; 来源：

Intrinsic to count trailing zero bits in 64-bit integers?

阅读更多关于 Intrinsic to count trailing zero bits in 64-bit integers?

问题 this is sort of a follow up on some previous questions on bit manipulation. I modified the code from this site to enumerate strings with K of N bits set (x is the current int64_t with K bits set, and at the end of this code it is the lexicographically next integer with K bits set): int64_t b, t, c, m, r,z; b = x & -x; t = x + b; c = x^t; // was m = (c >> 2)/b per link z = __builtin_ctz(x); m = c >> 2+z; x = t|m; The modification using __builtin_ctz() works fine as long as the least