intrinsics

Equivalent of InterlockedIncrement in Linux/gcc

懵懂的女人 提交于 2019-12-18 11:04:10
问题 It would be a very simple question (could be duplicated), but I was unable to find it. Win32 API provides a very handy set of atomic operations (as intrinsics) such as InterlockedIncrement which emits lock add x86 code. Also, InterlockedCompareExchange is mapped to lock cmpxchg . But, I want to do that in Linux with gcc. Since I'm working 64-bit, it's impossible to use inline assembly. Are there intrinsics for gcc? 回答1: GCC Atomic Built-ins 来源: https://stackoverflow.com/questions/2125937

How to use if condition in intrinsics

╄→гoц情女王★ 提交于 2019-12-18 05:23:15
问题 I want to compare two floating point variables using intrinsics. If the comparison is true, do something else do something. I want to do this as a normal if..else condition. Is there any way using intrinsics? //normal code vector<float> v1, v2; for(int i = 0; i < v1.size(); ++i) if(v1[i]<v2[i]) { //do something } else { //do something ) How to do this using SSE2 or AVX? 回答1: SIMD conditional operations are done with branchless techniques. You use a packed-compare instruction to get a vector

Is there a good reference for ARM Neon intrinsics?

限于喜欢 提交于 2019-12-17 21:58:17
问题 The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed? 回答1: For more information on the instructions themselves, you need the Assembler Guide. The list you found there just shows the mapping from compiler intrinsics to assembly instructions. 回答2: There's also the ARM C Language Extensions which provides details on the

VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

半世苍凉 提交于 2019-12-17 20:25:24
问题 The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found." If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \O1 or \O2 the index is not zero and the assert() fails. #include <iostream> #include <cassert> using namespace std; int main() { unsigned long index = 0; _BitScanReverse64(&index, 0x0ull);

Visual C++ x64 add with carry

ぃ、小莉子 提交于 2019-12-17 18:57:55
问题 Since there doesn't seem to be an intrinsic for ADC and I can't use inline assembler for x64 architecture with Visual C++, what should I do if I want to write a function using add with carry but include it in a C++ namespace? (Emulating with comparison operators is not an option. This 256 megabit add is performance critical.) 回答1: There is now an instrinsic for ADC in MSVC: _addcarry_u64 . The following code #include <inttypes.h> #include <intrin.h> #include <stdio.h> typedef struct { uint64

GNU C native vectors: how to broadcast a scalar, like x86's _mm_set1_epi16

空扰寡人 提交于 2019-12-17 16:59:28
问题 How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic? typedef uint16_t v8su __attribute__((vector_size(16))); v8su set1_u16_x86(uint16_t scalar) { return (v8su)_mm_set1_epi16(scalar); // cast needed for gcc } Surely there must be a better way than v8su set1_u16(uint16_t s) { return (v8su){s,s,s,s, s,s,s,s}; } I don't want to write an AVX2 version of that for broadcasting a single byte! Even a gcc-only or clang-only answer to this part

Questions about the performance of different implementations of strlen [closed]

孤街醉人 提交于 2019-12-17 16:53:52
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I have implemented the strlen() function in different ways, including SSE2 assembly , SSE4.2 assembly and SSE2 intrinsic , I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc . However, their performance in terms of milliseconds (time) are unexpected. My experiment

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

风流意气都作罢 提交于 2019-12-17 16:44:40
问题 I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed: __m128i oldRect; // contains old left, top, right, bottom packed to 128 bits __m128i newRect; // contains new left, top, right, bottom packed to 128 bits __m128i xor = _mm_xor_si128(oldRect, newRect); At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that? Currently I am doing so: if (xor.m128i_u64[0] | xor

What's the difference between logical SSE intrinsics?

南笙酒味 提交于 2019-12-17 16:06:31
问题 Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86

_addcarry_u64 and _addcarryx_u64 with MSVC and ICC

牧云@^-^@ 提交于 2019-12-17 07:52:11
问题 MSVC and ICC both support the intrinsics _addcarry_u64 and _addcarryx_u64 . According to Intel's Intrinsic Guide and white paper these should map to adcx and adox respectively. However, by looking at the generated assembly it's clear they map to adc and adcx respectively and there is no intrinsic which maps to adox . Additionally, telling the compiler to enable AVX2 with /arch:AVX2 in MSVC or -march=core-avx2 with ICC on Linux makes no difference. I'm not sure how to enable ADX with MSVC and