micro-optimization | 易学教程

How to force GCC to assume that a floating-point expression is non-negative?

阅读更多关于 How to force GCC to assume that a floating-point expression is non-negative?

问题 There are cases where you know that a certain floating-point expression will always be non-negative. For example, when computing the length of a vector, one does sqrt(a[0]*a[0] + ... + a[N-1]*a[N-1]) (NB: I am aware of std::hypot , this is not relevant to the question), and the expression under the square root is clearly non-negative. However, GCC outputs the following assembly for sqrt(x*x) : mulss xmm0, xmm0 pxor xmm1, xmm1 ucomiss xmm1, xmm0 ja .L10 sqrtss xmm0, xmm0 ret .L10: jmp sqrtf

Are there any performance test results for usage of likely/unlikely hints?

阅读更多关于 Are there any performance test results for usage of likely/unlikely hints?

问题 gcc features likely/unlikely hints that help the compiler to generate machine code with better branch prediction. Is there any data on how proper usage or failure to use those hints affects performance of real code on some real systems? 回答1: The question differs, but Peter Cordes's answer on this question gives a clear hint ;) . Modern CPU's ignore static hints and use dynamic branch prediction. 回答2: I don't know of any thorough analysis of such particular hints. In any case, it would be

Does each Floating point operation take the same time?

阅读更多关于 Does each Floating point operation take the same time?

问题 I believe integer addition or subtraction always take the same time no matter how big the operands are. Time needed for ALU output to be stabilized may vary over input operands, but CPU component that exploits ALU output will wait sufficiently long time so that any integer operation will be processed in SAME cycles. (Cycles needed for ADD, SUB, MUL, and DIV will be different, but ADD will take the same cycles regardless of input operands, I think.) Is this true for floating point operation,

How eliminate duplicate cases from a switch statement in PHP

阅读更多关于 How eliminate duplicate cases from a switch statement in PHP

问题 I'm making a function to return whether or not the given user_id is a staff member of the site. This is what I have and it works, however I feel like it can be greatly improved. public function isUserStaff($uid) { $stmt = $this->conn->prepare("SELECT user_role FROM users WHERE user_id=:user_id"); $stmt->execute(array(':user_id'=>$uid)); $userRow = $stmt->fetch(PDO::FETCH_ASSOC); $role = $userRow['user_role']; switch($role) { case 3: return true; break; case 4: return true; break; case 5:

How eliminate duplicate cases from a switch statement in PHP

阅读更多关于 How eliminate duplicate cases from a switch statement in PHP

What is the minimal number of dependency chains to maximize the execution throughput?

阅读更多关于 What is the minimal number of dependency chains to maximize the execution throughput?

问题 Given a chain of instructions linked by true dependencies and repeated periodically (i.e. a loop), for example (a->b->c)->(a->b->c)->... Assuming that it can be split into several shorter and independent sub-dependency chains to benefit from out-of-order execution : (a0->b0->c0)->(a0->b0->c0)->... (a1->b1->c1)->(a1->b1->c1)->... The out-of-order engine schedules each instruction to the corresponding CPU unit which have a latency and a reciprocal throughput. What is the optimal number of sub

Efficient Assembly multiplication

阅读更多关于 Efficient Assembly multiplication

问题 Started to practice assembly, not too long ago. I want to implement an efficient multiplying through assembly commands lea and shift. I want to write a c program that will call an assembly procedure that fits an constant argument recieved by the user and will multiply another argument recieved by the user by that constant. How can I make this code effective? What numbers can I group (if any) to fit the same procedure? for example I think that I can group 2,4,8,... to the same procedure as

In x86 assembly, is it better to use two separate registers for imul?

阅读更多关于 In x86 assembly, is it better to use two separate registers for imul?

问题 I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns? mov %rbx, %rcx imul %rcx, %rcx or mov %rbx, %rcx imul %rbx, %rcx Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly. 回答1: resources where I could read about this type of thing See Agner Fog's microarch pdf, and his optimizing

Assembly - How to score a CPU instruction by latency and throughput

阅读更多关于 Assembly - How to score a CPU instruction by latency and throughput

问题 I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles. Let's take the follow assembly program for an example, nop mov eax,dword ptr [rbp+34h] inc eax mov dword ptr [rbp+34h],eax and the following Intel Skylake information: mov r,m : Throughput=0.5 Latency=2 mov m,r : Throughput=1 Latency=2 nop : Throughput=0.25 Latency=non inc : Throughput=0.25 Latency=1 I know that the order of the

How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?

阅读更多关于 How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?

问题 I'm implementing a program that parses tape archives. Part of the parser logic is checking for an end-of-archive marker which is a 512-byte block full of NUL bytes. I wrote the following code for this purpose, expecting gcc to optimize this well: int is_eof_block(const char usth[static 512]) { size_t i; for (i = 0; i < 512; i++) if (usth[i] != '\0') return 0; return 1; } But to my surprise, gcc still generates terrible code for that, even though I explicitly allow it to access the whole 512