micro-optimization

How to force GCC to assume that a floating-point expression is non-negative?

南楼画角 提交于 2019-12-31 08:33:23
问题 There are cases where you know that a certain floating-point expression will always be non-negative. For example, when computing the length of a vector, one does sqrt(a[0]*a[0] + ... + a[N-1]*a[N-1]) (NB: I am aware of std::hypot , this is not relevant to the question), and the expression under the square root is clearly non-negative. However, GCC outputs the following assembly for sqrt(x*x) : mulss xmm0, xmm0 pxor xmm1, xmm1 ucomiss xmm1, xmm0 ja .L10 sqrtss xmm0, xmm0 ret .L10: jmp sqrtf

Are there any performance test results for usage of likely/unlikely hints?

我的梦境 提交于 2019-12-30 06:11:11
问题 gcc features likely/unlikely hints that help the compiler to generate machine code with better branch prediction. Is there any data on how proper usage or failure to use those hints affects performance of real code on some real systems? 回答1: The question differs, but Peter Cordes's answer on this question gives a clear hint ;) . Modern CPU's ignore static hints and use dynamic branch prediction. 回答2: I don't know of any thorough analysis of such particular hints. In any case, it would be

Does each Floating point operation take the same time?

 ̄綄美尐妖づ 提交于 2019-12-29 09:15:11
问题 I believe integer addition or subtraction always take the same time no matter how big the operands are. Time needed for ALU output to be stabilized may vary over input operands, but CPU component that exploits ALU output will wait sufficiently long time so that any integer operation will be processed in SAME cycles. (Cycles needed for ADD, SUB, MUL, and DIV will be different, but ADD will take the same cycles regardless of input operands, I think.) Is this true for floating point operation,

How eliminate duplicate cases from a switch statement in PHP

99封情书 提交于 2019-12-27 03:36:12
问题 I'm making a function to return whether or not the given user_id is a staff member of the site. This is what I have and it works, however I feel like it can be greatly improved. public function isUserStaff($uid) { $stmt = $this->conn->prepare("SELECT user_role FROM users WHERE user_id=:user_id"); $stmt->execute(array(':user_id'=>$uid)); $userRow = $stmt->fetch(PDO::FETCH_ASSOC); $role = $userRow['user_role']; switch($role) { case 3: return true; break; case 4: return true; break; case 5:

How eliminate duplicate cases from a switch statement in PHP

天涯浪子 提交于 2019-12-27 03:36:03
问题 I'm making a function to return whether or not the given user_id is a staff member of the site. This is what I have and it works, however I feel like it can be greatly improved. public function isUserStaff($uid) { $stmt = $this->conn->prepare("SELECT user_role FROM users WHERE user_id=:user_id"); $stmt->execute(array(':user_id'=>$uid)); $userRow = $stmt->fetch(PDO::FETCH_ASSOC); $role = $userRow['user_role']; switch($role) { case 3: return true; break; case 4: return true; break; case 5:

What is the minimal number of dependency chains to maximize the execution throughput?

怎甘沉沦 提交于 2019-12-25 09:48:09
问题 Given a chain of instructions linked by true dependencies and repeated periodically (i.e. a loop), for example (a->b->c)->(a->b->c)->... Assuming that it can be split into several shorter and independent sub-dependency chains to benefit from out-of-order execution : (a0->b0->c0)->(a0->b0->c0)->... (a1->b1->c1)->(a1->b1->c1)->... The out-of-order engine schedules each instruction to the corresponding CPU unit which have a latency and a reciprocal throughput. What is the optimal number of sub

Efficient Assembly multiplication

╄→гoц情女王★ 提交于 2019-12-24 05:20:32
问题 Started to practice assembly, not too long ago. I want to implement an efficient multiplying through assembly commands lea and shift. I want to write a c program that will call an assembly procedure that fits an constant argument recieved by the user and will multiply another argument recieved by the user by that constant. How can I make this code effective? What numbers can I group (if any) to fit the same procedure? for example I think that I can group 2,4,8,... to the same procedure as

In x86 assembly, is it better to use two separate registers for imul?

廉价感情. 提交于 2019-12-24 01:39:42
问题 I am wondering, mostly out of curiosity, if using the same register for an operation is better than using two. What would be better, considering performance and/or other concerns? mov %rbx, %rcx imul %rcx, %rcx or mov %rbx, %rcx imul %rbx, %rcx Any tips for how to benchmark this, or resources where I could read about this type of thing would be appreciated, as I am new to assembly. 回答1: resources where I could read about this type of thing See Agner Fog's microarch pdf, and his optimizing

Assembly - How to score a CPU instruction by latency and throughput

瘦欲@ 提交于 2019-12-24 00:24:11
问题 I'm looking for a type of a formula / way to measure how fast an instruction is, or more specific to give a "score" each of the instruction by CPU cycles. Let's take the follow assembly program for an example, nop mov eax,dword ptr [rbp+34h] inc eax mov dword ptr [rbp+34h],eax and the following Intel Skylake information: mov r,m : Throughput=0.5 Latency=2 mov m,r : Throughput=1 Latency=2 nop : Throughput=0.25 Latency=non inc : Throughput=0.25 Latency=1 I know that the order of the

How to get gcc to generate decent code that checks if a buffer is full of NUL bytes?

让人想犯罪 __ 提交于 2019-12-23 20:02:12
问题 I'm implementing a program that parses tape archives. Part of the parser logic is checking for an end-of-archive marker which is a 512-byte block full of NUL bytes. I wrote the following code for this purpose, expecting gcc to optimize this well: int is_eof_block(const char usth[static 512]) { size_t i; for (i = 0; i < 512; i++) if (usth[i] != '\0') return 0; return 1; } But to my surprise, gcc still generates terrible code for that, even though I explicitly allow it to access the whole 512