micro-optimization

What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

╄→尐↘猪︶ㄣ 提交于 2019-11-27 16:32:36
I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though. To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results). Here is the forum topic about this and simular problems. In short, I want to understand which code is better. Code like this: sub esp, c mov [esp+8],eax mov [esp+4],ecx mov [esp],edx ... add esp, c or code like this:

Is vxorps-zeroing on AMD Jaguar/Bulldozer/Zen faster with xmm registers than ymm?

岁酱吖の 提交于 2019-11-27 15:06:17
AMD CPUs handle 256b AVX instructions by decoding into two 128b operations. e.g. vaddps ymm0, ymm1,ymm1 on AMD Steamroller decodes to 2 macro-ops, with half the throughput of vaddps xmm0, xmm1,xmm1 . XOR-zeroing is a special case (no input dependency, and on Jaguar at least avoids consuming a physical register file entry , and enables movdqa from that register to be eliminated at issue/rename, like Bulldozer does all the time even for non-zerod regs). But is it detected early enough that vxorps ymm0,ymm0,ymm0 still only decodes to 1 macro-op with equal performance to vxorps xmm0,xmm0,xmm0 ?

Speed of CSS

拥有回忆 提交于 2019-11-27 14:53:42
问题 This is just a question to help me understand CSS rendering better. Lets say we have a million lines of this. <div class="first"> <div class="second"> <span class="third">Hello World</span> </div> </div> Which would be the fastest way to change the font of Hello World to red? .third { color: red; } div.third { color: red; } div.second div.third { color: red; } div.first div.second div.third { color: red; } Also, what if there was at tag in the middle that had a unique id of "foo". Which one

array_push() vs. $array[] = … Which is fastest? [duplicate]

时光怂恿深爱的人放手 提交于 2019-11-27 13:55:19
This question already has an answer here: What's better to use in PHP $array[] = $value or array_push($array, $value)? 10 answers I need to add values received from MySQL into an array [PHP], here is what I've got: $players = array(); while ($homePlayerRow = mysql_fetch_array($homePlayerResult)) { $players[] = $homePlayerRow['player_id']; } Is this the only way of doing it? Also , is the following faster/better? $players = array(); while ($homePlayerRow = mysql_fetch_array($homePlayerResult)) { array_push($players, $homePlayerRow['player_id']); } Thanks in advance u can run and see that array

Does using xor reg, reg give advantage over mov reg, 0? [duplicate]

匆匆过客 提交于 2019-11-27 12:33:47
This question already has an answer here: What is the best way to set a register to zero in x86 assembly: xor, mov or and? 1 answer There're two well-known ways to set an integer register to zero value on x86. Either mov reg, 0 or xor reg, reg There's an opinion that the second variant is better since the value 0 is not stored in the code and that saves several bytes of produced machine code. This is definitely good - less instruction cache is used and this can sometimes allow for faster code execution. Many compilers produce such code. However there's formally an inter-instruction dependency

Is it more efficient to perform a range check by casting to uint instead of checking for negative values?

老子叫甜甜 提交于 2019-11-27 11:50:27
问题 I stumbled upon this piece of code in .NET's List source code: // Following trick can reduce the range check by one if ((uint) index >= (uint)_size) { ThrowHelper.ThrowArgumentOutOfRangeException(); } Apparently this is more efficient (?) than if (index < 0 || index >= _size) I am curious about the rationale behind the trick. Is a single branch instruction really more expensive than two conversions to uint ? Or is there some other optimization going on that will make this code faster than an

Cycles/cost for L1 Cache hit vs. Register on x86?

隐身守侯 提交于 2019-11-27 09:53:24
问题 I remember assuming that an L1 cache hit is 1 cycle (i.e. identical to register access time) in my architecture class, but is that actually true on modern x86 processors? How many cycles does an L1 cache hit take? How does it compare to register access? 回答1: Here's a great article on the subject: http://arstechnica.com/gadgets/reviews/2002/07/caching.ars/1 To answer your question - yes, a cache hit has approximately the same cost as a register access. And of course a cache miss is quite

Why NASM on Linux changes registers in x86_64 assembly

不打扰是莪最后的温柔 提交于 2019-11-27 09:21:11
I am new to x86_64 assembly programming. I was writing simple "Hello World" program in x86_64 assembly. Below is my code, which runs perfectly fine. global _start section .data msg: db "Hello to the world of SLAE64", 0x0a mlen equ $-msg section .text _start: mov rax, 1 mov rdi, 1 mov rsi, msg mov rdx, mlen syscall mov rax, 60 mov rdi, 4 syscall Now when I disassemble in gdb, it gives below output: (gdb) disas Dump of assembler code for function _start: => 0x00000000004000b0 <+0>: mov eax,0x1 0x00000000004000b5 <+5>: mov edi,0x1 0x00000000004000ba <+10>: movabs rsi,0x6000d8 0x00000000004000c4 <

Which is faster, imm64 or m64 for x86-64?

陌路散爱 提交于 2019-11-27 08:29:09
问题 After testing about 10 billion times, if imm64 is 0.1 nanoseconds faster than m64 for AMD64, The m64 seems to be faster, but I don't really understand. Isn't the address of val_ptr in the following code an immediate value itself? # Text section .section __TEXT,__text,regular,pure_instructions # 64-bit code .code64 # Intel syntax .intel_syntax noprefix # Target macOS High Sierra .macosx_version_min 10,13,0 # Make those two test functions global for the C measurer .globl _test1 .globl _test2 #

Loading an xmm from GP regs

久未见 提交于 2019-11-27 08:21:20
问题 Let's say you have values in rax and rdx you want to load into an xmm register. One way would be: movq xmm0, rax pinsrq xmm0, rdx, 1 It's pretty slow though! Is there a better way? 回答1: You're not going to do better for latency or uop count on recent Intel or AMD (I mostly looked at Agner Fog's tables for Ryzen / Skylake). movq+movq+punpcklqdq is also 3 uops, for the same port(s). On Intel / AMD, storing the GP registers to a temporary location and reloading them with a 16-byte read may be