intrinsics

Count leading zeros in __m256i word

白昼怎懂夜的黑 提交于 2019-12-06 00:53:03
问题 I'm tinkering around with AVX-2 instructions and I'm looking for a fast way to count the number of leading zeros in a __m256i word (which has 256 bits). So far, I have figured out the following way: // Computes the number of leading zero bits. // Here, avx_word is of type _m256i. if (!_mm256_testz_si256(avx_word, avx_word)) { uint64_t word = _mm256_extract_epi64(avx_word, 0); if (word > 0) return (__builtin_clzll(word)); word = _mm256_extract_epi64(avx_word, 1); if (word > 0) return (_

Using % with SSE2?

懵懂的女人 提交于 2019-12-05 20:52:15
Here's the code I'm trying to convert to SSE2: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left = audioLeft; double *right = audioRight; double phase = 0.0; double bp0 = mNoteFrequency * mHostPitch; for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) { // some other code (that will use phase) phase += std::clamp(mRadiansPerSample * (bp0 * pB[sampleIndex] + pC[sampleIndex]), 0.0, PI); while (phase >= TWOPI) { phase -= TWOPI; } } Here's what I've achieved: double *pA = a; double *pB = b[voiceIndex]; double *pC = c[voiceIndex]; double *left =

Determine the minimum across SIMD lanes of __m256 value

做~自己de王妃 提交于 2019-12-05 17:03:53
I understand that operations across SIMD lanes should generally be avoided. However, sometimes it has to be done. I am using AVX2 intrinsics, and have 8 floating point values in an __m256. I want to know the lowest value in this vector, and to complicate matters: also in which slot this was. My current solution makes a round trip to memory, which I don't like: float closestvals[8]; _mm256_store_ps( closestvals, closest8 ); float closest = closestvals[0]; int closestidx = 0; for ( int k=1; k<8; ++k ) { if ( closestvals[k] < closest ) { closest = closestvals[ k ]; closestidx = k; } } What would

Temporary/“non-addressable” fixed-size array?

社会主义新天地 提交于 2019-12-05 16:04:06
The title is in lack of a better name, and I am not sure I managed to explain myself clearly enough. I am looking for a way to access a "data type" via an index, but not force the compiler to keep it in an array. The problem occurs in writing a low-level code based on SSE/AVX intrinsics. For ease of programming I would like to write code as the following, with fixed-length loops over "registers" (data type __m512 ): inline void load(__m512 *vector, const float *in) { for(int i=0; i<24; i++) vector[i] = _mm512_load_ps((in + i*SIMD_WIDTH)); } // similarely: inline add(...) and inline store(...)

Efficiently compute max of an array of 8 elements in arm neon

时光毁灭记忆、已成空白 提交于 2019-12-05 14:26:13
How do I find max element in array of 8 bytes, 8 shorts or 8 ints? I may need just the position of the max element, value of the max element, or both of them. For example : unsigned FindMax8(const uint32_t src[8]) // returns position of max element { unsigned ret = 0; for (unsigned i=0; i<8; ++i) { if (src[i] > src[ret]) ret = i; } return ret; } At -O2 clang unrolls the loop but it does not use neon, which should give decent perf boost (because it eliminates many data dependent branches?) For 8 bytes and 8 shorts approach should be simpler as entire array can be loaded into a single q-register

Generate call to intrinsic using LLVM C API

霸气de小男生 提交于 2019-12-05 14:13:38
I'm working on some code that uses the LLVM C API. How do I use intrinsics, such as llvm.cos.f64 or llvm.sadd.with.overflow.i32 ? Whenever I try to do it by generating a global with LLVMAddGlobal (with the correct type signature), I just get this error message during the JIT linking stage: LLVM ERROR: Could not resolve external global address: llvm.cos.f64 I'm not using the LLVM C++ interface , so the advice in LLVM insert intrinsic function Cos does not seem to apply. I presume I need something like Intrinsic::getDeclaration , but I can't seem to find it. Am I missing something obvious? No

Matrix accessing and multiplication optimization for cpu

别来无恙 提交于 2019-12-05 14:06:23
Im making some intrinsic optimized matrix-wrapper in java(with help of JNI). Needing affirmation of this, can you give some hints about matrix optimizations? What Im going to implement is: Matrix can be represented as four set of buffers/arrays, one for horizontal accessing, one for vertical accessing, one for diagonal access and a command buffer to compute elements of matrix only when needed. Here is an illustration. Matrix signature: 0 1 2 3 4 5 6 7 8 9 1 3 3 5 2 9 First(hroizontal) set: horSet[0]={0,1,2,3} horSet[1]={4,5,6,7} horSet[2]={8,9,1,3} horSet[3]={3,5,2,9} Second(vertical) set:

Are built-in intrinsic functions available in Swift 3?

不问归期 提交于 2019-12-05 13:01:32
I can see various built-in functions (like __builtin_popount, __builtin_clz, etc) in the Xcode auto-completion pop-up. I'm not sure where these are getting picked up from though. Command clicking doesn't lead to a swift definition or any documentation. Are any __builtin_* or equivalent intrinsic functions available in Swift 3 and if so, what modules do I need to include and how can I call them? 来源: https://stackoverflow.com/questions/41353482/are-built-in-intrinsic-functions-available-in-swift-3

Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

依然范特西╮ 提交于 2019-12-05 12:39:39
Suppose we have an std::vector , or any other sequence container (sometimes it will be a deque), which store uint64_t elements. Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word. The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as possible, I've tried to code everything without any if branch to not cause branch mispredictions, so for

What does “vperm v0,v0,v0,v17” with unused v0 do?

主宰稳场 提交于 2019-12-05 11:17:42
I'm working on an SHA-256 implementation using Power8 built-ins . The performance is off a bit. I estimate it is off by about 2 cycles per byte (cpb). The C/C++ code to perform SHA on a block looks like so: // Schedule 64-byte message SHA256_SCHEDULE(W, data); uint32x4_p8 a = abcd, e = efgh; uint32x4_p8 b = VectorShiftLeft<4>(a); uint32x4_p8 f = VectorShiftLeft<4>(e); uint32x4_p8 c = VectorShiftLeft<4>(b); uint32x4_p8 g = VectorShiftLeft<4>(f); uint32x4_p8 d = VectorShiftLeft<4>(c); uint32x4_p8 h = VectorShiftLeft<4>(g); for (unsigned int i=0; i<64; i+=4) { const uint32x4_p8 k =