micro-optimization | 易学教程

Is CMOVcc considered a branching instruction?

阅读更多关于 Is CMOVcc considered a branching instruction?

问题 I have this memchr code that I'm trying to make non-branching: .globl memchr memchr: mov %rdx, %rcx mov %sil, %al cld repne scasb lea -1(%rdi), %rax test %rcx, %rcx cmove %rcx, %rax ret I'm unsure whether or not cmove is a branching instruction. Is it? If so, how do I rearrange my code so it doesn't branch? 回答1: No, it's not a branch, that's the whole point of cmovcc . It's an ALU select that has a data dependency on both inputs, not a control dependency . (With a memory source, it

What do multiple values or ranges means as the latency for a single instruction?

阅读更多关于 What do multiple values or ranges means as the latency for a single instruction?

问题 I have a question about instruction latency on https://uops.info/. For some instructions like PCMPEQB(XMM, M128) the latency in the table entry for Skylake is listed as [1;≤8] I know a little about latency, but what i know is that it's just a single number !!! for example, 1 or 2 or 3 or ... but what is this [1;≤8] !!!??? It means latency depends on memory and it's between 1 and 8 ? If it's true, when is it 1 .. when is it 3, etc? For example, what is the latency for this : pcmpeqb xmm0,

What do multiple values or ranges means as the latency for a single instruction?

阅读更多关于 What do multiple values or ranges means as the latency for a single instruction?

Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

阅读更多关于 Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

问题 I assume simple spinlock that does not go to OS waiting for the purposes of this question. I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg . But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg ? Or does cmpxchg write data and invalidate cache line of other cores even on failure? This question is similar to What specifically marks an x86 cache line as dirty - any write,

Performance penalty: denormalized numbers versus branch mis-predictions

阅读更多关于 Performance penalty: denormalized numbers versus branch mis-predictions

问题 For those that have already measured or have deep knowledge about this kind of considerations, assume that you have to do the following (just to pick any for the example) floating-point operator: float calc(float y, float z) { return sqrt(y * y + z * z) / 100; } Where y and z could be denormal numbers, let's assume two possible situations where just y, just z, or maybe both, in a totally random manner, can be denormal numbers 50% of the time <1% of the time And now assume I want to avoid the

Why does .NET Native compile loop in reverse order?

阅读更多关于 Why does .NET Native compile loop in reverse order?

问题 I'm working on optimization techniques performed by the .NET Native compiler. I've created a sample loop: for (int i = 0; i < 100; i++) { Function(); } And I've compiled it with Native. Then I disassembled the result .dll file with machine code inside in IDA. As the result, I have: (I've removed a few unnecessary lines, so don't worry that address lines are inconsistent) I understand that add esi, 0FFFFFFFFh means really subtract one from esi and alter Zero Flag if needed , so we can jump to

Why does .NET Native compile loop in reverse order?

阅读更多关于 Why does .NET Native compile loop in reverse order?

Very fast approximate Logarithm (natural log) function in C++?

阅读更多关于 Very fast approximate Logarithm (natural log) function in C++?

问题 We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log . It's part of loops in my program and its called multiple times and while exp and sqrt were optimized, Intel VTune now suggest me to optimize std::log , after that it seems that only my design choices will be limiting. For now I use a 3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90% of the case for

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

阅读更多关于 An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

问题 A following-up question for Why does this `std::atomic_thread_fence` work As a dummy interlocked operation is better than _mm_mfence , and there are quite many ways to implement it, which interlocked operation and on what data should be used? Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers. 回答1: Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

阅读更多关于 An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties