intrinsics | 易学教程

Equivalent of InterlockedIncrement in Linux/gcc

阅读更多关于 Equivalent of InterlockedIncrement in Linux/gcc

问题 It would be a very simple question (could be duplicated), but I was unable to find it. Win32 API provides a very handy set of atomic operations (as intrinsics) such as InterlockedIncrement which emits lock add x86 code. Also, InterlockedCompareExchange is mapped to lock cmpxchg . But, I want to do that in Linux with gcc. Since I'm working 64-bit, it's impossible to use inline assembly. Are there intrinsics for gcc? 回答1: GCC Atomic Built-ins 来源： https://stackoverflow.com/questions/2125937

How to use if condition in intrinsics

阅读更多关于 How to use if condition in intrinsics

问题 I want to compare two floating point variables using intrinsics. If the comparison is true, do something else do something. I want to do this as a normal if..else condition. Is there any way using intrinsics? //normal code vector<float> v1, v2; for(int i = 0; i < v1.size(); ++i) if(v1[i]<v2[i]) { //do something } else { //do something ) How to do this using SSE2 or AVX? 回答1: SIMD conditional operations are done with branchless techniques. You use a packed-compare instruction to get a vector

Is there a good reference for ARM Neon intrinsics?

阅读更多关于 Is there a good reference for ARM Neon intrinsics?

问题 The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed? 回答1: For more information on the instructions themselves, you need the Assembler Guide. The list you found there just shows the mapping from compiler intrinsics to assembly instructions. 回答2: There's also the ARM C Language Extensions which provides details on the

VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

阅读更多关于 VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

问题 The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found." If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \O1 or \O2 the index is not zero and the assert() fails. #include <iostream> #include <cassert> using namespace std; int main() { unsigned long index = 0; _BitScanReverse64(&index, 0x0ull);

Visual C++ x64 add with carry

阅读更多关于 Visual C++ x64 add with carry

问题 Since there doesn't seem to be an intrinsic for ADC and I can't use inline assembler for x64 architecture with Visual C++, what should I do if I want to write a function using add with carry but include it in a C++ namespace? (Emulating with comparison operators is not an option. This 256 megabit add is performance critical.) 回答1: There is now an instrinsic for ADC in MSVC: _addcarry_u64 . The following code #include <inttypes.h> #include <intrin.h> #include <stdio.h> typedef struct { uint64

GNU C native vectors: how to broadcast a scalar, like x86's _mm_set1_epi16

阅读更多关于 GNU C native vectors: how to broadcast a scalar, like x86's _mm_set1_epi16

问题 How do I write a portable GNU C builtin vectors version of this, which doesn't depend on the x86 set1 intrinsic? typedef uint16_t v8su __attribute__((vector_size(16))); v8su set1_u16_x86(uint16_t scalar) { return (v8su)_mm_set1_epi16(scalar); // cast needed for gcc } Surely there must be a better way than v8su set1_u16(uint16_t s) { return (v8su){s,s,s,s, s,s,s,s}; } I don't want to write an AVX2 version of that for broadcasting a single byte! Even a gcc-only or clang-only answer to this part

Questions about the performance of different implementations of strlen [closed]

阅读更多关于 Questions about the performance of different implementations of strlen [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I have implemented the strlen() function in different ways, including SSE2 assembly , SSE4.2 assembly and SSE2 intrinsic , I also exerted some experiments on them, with strlen() in <string.h> and strlen() in glibc . However, their performance in terms of milliseconds (time) are unexpected. My experiment

Most efficient way to check if all __m128i components are 0 [using <= SSE4.1 intrinsics]

阅读更多关于 Most efficient way to check if all __m128i components are 0 [using

问题 I am using SSE intrinsics to determine if a rectangle (defined by four int32 values) has changed: __m128i oldRect; // contains old left, top, right, bottom packed to 128 bits __m128i newRect; // contains new left, top, right, bottom packed to 128 bits __m128i xor = _mm_xor_si128(oldRect, newRect); At this point, the resulting xor value will be all zeros if the rectangle hasn't changed. What is then the most efficient way of determining that? Currently I am doing so: if (xor.m128i_u64[0] | xor

What's the difference between logical SSE intrinsics?

阅读更多关于 What's the difference between logical SSE intrinsics?

问题 Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86

_addcarry_u64 and _addcarryx_u64 with MSVC and ICC

阅读更多关于 _addcarry_u64 and _addcarryx_u64 with MSVC and ICC

问题 MSVC and ICC both support the intrinsics _addcarry_u64 and _addcarryx_u64 . According to Intel's Intrinsic Guide and white paper these should map to adcx and adox respectively. However, by looking at the generated assembly it's clear they map to adc and adcx respectively and there is no intrinsic which maps to adox . Additionally, telling the compiler to enable AVX2 with /arch:AVX2 in MSVC or -march=core-avx2 with ICC on Linux makes no difference. I'm not sure how to enable ADX with MSVC and