micro-optimization

Difference between “or eax,eax” and “test eax,eax” [duplicate]

南笙酒味 提交于 2019-11-29 10:06:22
This question already has an answer here: Test whether a register is zero with CMP reg,0 vs OR reg,reg? 2 answers What's the difference between or eax,eax and test eax,eax ? I've seen different compilers produce both for the same comparison and as far as documentation goes they do exactly the same thing, so I'm wondering why they don't all use test eax,eax . Thinking about it and eax,eax would set the flags in an identical fashion as either but I haven't seen it in either freepascal, delphi, or msVC++. I did compile some asm blocks in delphi and checked out the assembler source and all 3 forms

Fastest implementation of simple, virtual, observer-sort of, pattern in c++?

馋奶兔 提交于 2019-11-29 08:08:11
I'm working my arse off trying to implement an alternative for vtables using enums and a ton of macro magic that's really starting to mess with my brain. I'm starting to think i'm not walking the right path since the code is getting uglier and uglier, and will not be fit for production by any means. How can the pattern of the following code be implemented with the least amount of redirection/operations? It has to be done in standard c++, up to 17. class A{ virtual void Update() = 0; // A is so pure *¬* }; class B: public A { override void Update() final { // DO B STUFF } } class C: public A {

What is faster: many ifs, or else if?

和自甴很熟 提交于 2019-11-29 05:43:46
问题 I'm iterating through an array and sorting it by values into days of the week. In order to do it I'm using many if statements. Does it make any difference to the processing speed if I use many if s, versus a set of else if statements? 回答1: Yes, use an else if, consider the following code: if(predicateA){ //do Stuff } if(predicateB){ // do more stuff } of if(predicateA){ // } else if(predicateB){ // } in the second case if predicateA is true, predicateB (and any further predicates) will not

How can the rep stosb instruction execute faster than the equivalent loop?

荒凉一梦 提交于 2019-11-29 03:45:12
How can the instruction rep stosb execute faster than this code? Clear: mov byte [edi],AL ; Write the value in AL to memory inc edi ; Bump EDI to next byte in the buffer dec ecx ; Decrement ECX by one position jnz Clear ; And loop again until ECX is 0 Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually? In modern CPUs, rep stosb 's and rep movsb 's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock. ( Note this only applies to stos and movs, not repe

Using SIMD/AVX/SSE for tree traversal

不羁岁月 提交于 2019-11-28 23:26:42
问题 I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked. Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow

On the use and abuse of alloca

风流意气都作罢 提交于 2019-11-28 23:11:03
I am working on a soft-realtime event processing system. I would like to minimise as many calls in my code that have non-deterministic timing. I need to construct a message that consists of strings, numbers, timestamps and GUID's. Probably a std::vector of boost::variant 's. I have always wanted to use alloca in past code of a similar nature. However, when one looks into systems programming literature there are always massive cautions against this function call. Personally I can't think of a server class machine in the last 15 years that doesn't have virtual memory, and I know for a fact that

Is it more efficient to perform a range check by casting to uint instead of checking for negative values?

我与影子孤独终老i 提交于 2019-11-28 19:03:00
I stumbled upon this piece of code in .NET's List source code : // Following trick can reduce the range check by one if ((uint) index >= (uint)_size) { ThrowHelper.ThrowArgumentOutOfRangeException(); } Apparently this is more efficient (?) than if (index < 0 || index >= _size) I am curious about the rationale behind the trick. Is a single branch instruction really more expensive than two conversions to uint ? Or is there some other optimization going on that will make this code faster than an additional numeric comparison? To address the elephant in the room: yes, this is micro optimization,

Cycles/cost for L1 Cache hit vs. Register on x86?

廉价感情. 提交于 2019-11-28 16:24:20
I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors? How many cycles does an L1 cache hit take? How does it compare to register access? paulsm4 Here's a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite costly ;) PS: The specifics will vary, but this link has some good ballpark figures: Approximate cost to

Why does my application spend 24% of its life doing a null check?

南楼画角 提交于 2019-11-28 15:18:56
I've got a performance critical binary decision tree, and I'd like to focus this question on a single line of code. The code for the binary tree iterator is below with the results from running performance analysis against it. public ScTreeNode GetNodeForState(int rootIndex, float[] inputs) { 0.2% ScTreeNode node = RootNodes[rootIndex].TreeNode; 24.6% while (node.BranchData != null) { 0.2% BranchNodeData b = node.BranchData; 0.5% node = b.Child2; 12.8% if (inputs[b.SplitInputIndex] <= b.SplitValue) 0.8% node = b.Child1; } 0.4% return node; } BranchData is a field, not a property. I did this to

Packing two DWORDs into a QWORD to save store bandwidth

早过忘川 提交于 2019-11-28 13:58:18
Imagine a load-store loop like the following which loads DWORD s from non-contiguous locations and stores them contiguously: top: mov eax, DWORD [rsi] mov DWORD [rdi], eax mov eax, DWORD [rdx] mov DWORD [rdi + 4], eax ; unroll the above a few times ; increment rdi and rsi somehow cmp ... jne top On modern Intel and AMD hardware, when running in-cache such a loop will usually bottleneck ones stores at one store per cycle. That's kind of wasteful, since that's only an IPC of 2 (one store, one load). One idea that naturally arises is to combine two DWORD loads into a single QWORD store which is