intrinsics

Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

与世无争的帅哥 提交于 2019-12-08 20:22:57
问题 (Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object? , trying to use an MSVC-specific method on GCC's definition of __m128i . But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.) I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking. I continue

SSE3 intrinsics: How to find the maximum of a large array of floats

不羁岁月 提交于 2019-12-08 19:08:56
问题 I have the following code to find the maximum value int length = 2000; float *data; // data is allocated and initialized float max = 0.0; for(int i = 0; i < length; i++) { if(data[i] > max) { max = data; } } I tried vectorizing it by using SSE3 intrinsics, but I am kind of struck on how I should do the comparison. int length = 2000; float *data; // data is allocated and initialized float max = 0.0; // for time being just assume that length is always mod 4 for(int i = 0; i < length; i+=4) { _

Computing 8 horizontal sums of eight AVX single-precision floating-point vectors

大城市里の小女人 提交于 2019-12-08 17:43:34
问题 I have 8 AVX vectors containing 8 floats each (64 floats in total) and I want to sum elements in each vector together (basically perform eight horizontal sums). For now, I'm using the following code: __m256 HorizontalSums(__m256 v0, __m256 v1, __m256 v2, __m256 v3, __m256 v4, __m256 v5, __m256 v6, __m256 v7) { // transpose const __m256 t0 = _mm256_unpacklo_ps(v0, v1); const __m256 t1 = _mm256_unpackhi_ps(v0, v1); const __m256 t2 = _mm256_unpacklo_ps(v2, v3); const __m256 t3 = _mm256_unpackhi

Converting from __m128 to __m128i results in wrong value

ぃ、小莉子 提交于 2019-12-08 11:37:16
问题 I need to convert a float vector (__m128) to an integer vector (__m128i), and I am using _mm_cvtps_epi32 , but I am not getting the expected value. Here is a very simple example: __m128 test = _mm_set1_ps(4.5f); __m128i test_i = _mm_cvtps_epi32(test); The debugger output I get: (lldb) po test ([0] = 4.5, [1] = 4.5, [2] = 4.5, [3] = 4.5) (lldb) po test_i ([0] = 17179869188, [1] = 17179869188) (lldb) As you can see, the resulting integer is.. 17179869188? From 4.5? And why are there only two

SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation

六眼飞鱼酱① 提交于 2019-12-08 07:47:19
问题 I have started optimising my code using SSE. Essentially it is a ray tracer that processes 4 rays at a time by storing the coordinates in __m128 data types x, y, z (the coordinates for the four rays are grouped by axis). However I have a branched statement which protects against divide by zero I can't seem to convert to SSE. In serial this is: const float d = wZ == -1.0f ? 1.0f/( 1.0f-wZ) : 1.0f/(1.0f+wZ); Where wZ is the z-coordinate and this calculation needs to be done for all four rays.

Why is there no floating point intrinsic for `PSHUFD` instruction?

☆樱花仙子☆ 提交于 2019-12-08 01:03:43
问题 The task I'm facing is to shuffle one _m128 vector and store the result in the other one. The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector: _mm_shuffle_ps , which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move. _mm_shuffle_epi32 , which uses PSHUFD instruction that seems to do exactly what is expected here and can

Add saturate 32-bit signed ints intrinsics?

落爺英雄遲暮 提交于 2019-12-07 16:31:14
问题 Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ? I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around. 回答1: A signed overflow will happen if (and only if): the signs of both inputs are the same, and the sign of the sum (when added with wrap-around) is different from the input Using C-Operators: overflow =

Which is the most efficient way to extract an arbitrary range of bits from a contiguous sequence of words?

天涯浪子 提交于 2019-12-07 11:27:43
问题 Suppose we have an std::vector , or any other sequence container (sometimes it will be a deque), which store uint64_t elements. Now, let's see this vector as a sequence of size() * 64 contiguous bits. I need to find the word formed by the bits in a given [begin, end) range, given that end - begin <= 64 so it fits in a word. The solution I have right now finds the two words whose parts will form the result, and separately masks and combines them. Since I need this to be as efficient as

Are built-in intrinsic functions available in Swift 3?

喜欢而已 提交于 2019-12-07 11:19:40
问题 I can see various built-in functions (like __builtin_popount, __builtin_clz, etc) in the Xcode auto-completion pop-up. I'm not sure where these are getting picked up from though. Command clicking doesn't lead to a swift definition or any documentation. Are any __builtin_* or equivalent intrinsic functions available in Swift 3 and if so, what modules do I need to include and how can I call them? 来源: https://stackoverflow.com/questions/41353482/are-built-in-intrinsic-functions-available-in

Generate call to intrinsic using LLVM C API

我只是一个虾纸丫 提交于 2019-12-07 11:12:50
问题 I'm working on some code that uses the LLVM C API. How do I use intrinsics, such as llvm.cos.f64 or llvm.sadd.with.overflow.i32 ? Whenever I try to do it by generating a global with LLVMAddGlobal (with the correct type signature), I just get this error message during the JIT linking stage: LLVM ERROR: Could not resolve external global address: llvm.cos.f64 I'm not using the LLVM C++ interface , so the advice in LLVM insert intrinsic function Cos does not seem to apply. I presume I need