micro-optimization | 易学教程

On the use and abuse of alloca

阅读更多关于 On the use and abuse of alloca

问题 I am working on a soft-realtime event processing system. I would like to minimise as many calls in my code that have non-deterministic timing. I need to construct a message that consists of strings, numbers, timestamps and GUID's. Probably a std::vector of boost::variant 's. I have always wanted to use alloca in past code of a similar nature. However, when one looks into systems programming literature there are always massive cautions against this function call. Personally I can't think of a

PHP: What is the fastest and easiest way to get the last item of an array?

阅读更多关于 PHP: What is the fastest and easiest way to get the last item of an array?

问题 What is the fastest and easiest way to get the last item of an array whether be indexed array , associative array or multi-dimensional array? 回答1: $myArray = array( 5, 4, 3, 2, 1 ); echo end($myArray); prints "1" 回答2: array_pop() It removes the element from the end of the array. If you need to keep the array in tact, you could use this and then append the value back to the end of the array. $array[] = $popped_val 回答3: try this: $arrayname[count(arrayname)-1] 回答4: I would say array_pop In the

Using SIMD/AVX/SSE for tree traversal

阅读更多关于 Using SIMD/AVX/SSE for tree traversal

I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked. Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow further on. My research lead to the following question: How many cpu cycles/instructions are wasted on

Passing null pointer to placement new

阅读更多关于 Passing null pointer to placement new

The default placement new operator is declared in 18.6 [support.dynamic] ¶1 with a non-throwing exception-specification: void* operator new (std::size_t size, void* ptr) noexcept; This function does nothing except return ptr; so it is reasonable for it to be noexcept , however according to 5.3.4 [expr.new] ¶15 this means that the compiler must check it doesn't return null before invoking the object's constructor: -15- [ Note: unless an allocation function is declared with a non-throwing exception-specification (15.4), it indicates failure to allocate storage by throwing a std::bad_alloc

Why swap doesn't use Xor operation in C++

阅读更多关于 Why swap doesn't use Xor operation in C++

问题 I've learned that Xor operation can be used to implement effective swap function. like this: template<class T> void swap(T& a, T& b) { a = a^b; b = a^b; a = a^b; } But the implementation of swap all i can found on the internet is essentially like this: template<class T> void swap(T& a, T& b) { T temp(a); a = b; b = temp; } It seems that the compiler didn't generate the same code for the two form above because I tested it on VC++ 2010 and the first one is done the job more quickly than std:

Loading an xmm from GP regs

阅读更多关于 Loading an xmm from GP regs

Let's say you have values in rax and rdx you want to load into an xmm register. One way would be: movq xmm0, rax pinsrq xmm0, rdx, 1 It's pretty slow though! Is there a better way? You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake). movq+movq+punpcklqdq is also 3 uops, for the same port(s). On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be worth considering for throughput if surrounding code bottlenecks on the ALU port for integer->vector, which is

Why jnz requires 2 cycles to complete in an inner loop

阅读更多关于 Why jnz requires 2 cycles to complete in an inner loop

I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop. The following simple program has an inner loop with fixed size 16: global _start _start: mov rcx, 100000000 .loop_outer: mov rax, 16 .loop_inner: dec rax jnz .loop_inner dec rcx jnz .loop_outer xor edi, edi mov eax, 60 syscall perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete. I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump". At this point I start to believe the above

Can modern x86 implementations store-forward from more than one prior store?

阅读更多关于 Can modern x86 implementations store-forward from more than one prior store?

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load? For example, consider the following sequence: mov [rdx + 0], eax mov [rdx + 2], eax mov ax, [rdx + 1] The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1? Note that by store-forwarding here I'm including any mechanism that can

How to force NASM to encode [1 + rax2] as disp32 + index2 instead of disp8 + base + index?

阅读更多关于 How to force NASM to encode [1 + rax*2] as disp32 + index*2 instead of disp8 + base + index?

To efficiently do x = x*10 + 1 , it's probably optimal to use lea eax, [rax + rax*4] ; x*=5 lea eax, [1 + rax*2] ; x = x*2 + 1 3-component LEA has higher latency on modern Intel CPUs, e.g. 3 cycles vs. 1 on Sandybridge-family, so disp32 + index*2 is faster than disp8 + base + index*1 on SnB-family , i.e. most of the mainstream x86 CPUs we care about optimizing for. (This mostly only applies to LEA, not loads/stores, because LEA runs on ALU execution units, not the AGUs in most modern x86 CPUs.) AMD CPUs have slower LEA with 3 components or scale > 1 ( http://agner.org/optimize/ ) But NASM and

x > -1 vs x >= 0, is there a performance difference

阅读更多关于 x > -1 vs x >= 0, is there a performance difference

问题 I have heard a teacher drop this once, and it has been bugging me ever since. Let's say we want to check if the integer x is bigger than or equal to 0. There are two ways to check this: if (x > -1){ //do stuff } and if (x >= 0){ //do stuff } According to this teacher > would be slightly faster then >= . In this case it was Java, but according to him this also applied for C, c++ and other languages. Is there any truth to this statement? 回答1: There's no difference in any real-world sense. Let's