micro-optimization

Why does instruction cache alignment improve performance in set associative cache implementations?

回眸只為那壹抹淺笑 提交于 2021-02-19 03:16:55
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

蹲街弑〆低调 提交于 2021-02-19 03:14:17
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why does instruction cache alignment improve performance in set associative cache implementations?

守給你的承諾、 提交于 2021-02-19 03:14:13
问题 I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything. I understand the concept of cache hits and their importance in computing speed. But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since

Why is using structure Vector3I instead of three ints much slower in C#?

主宰稳场 提交于 2021-02-18 21:43:34
问题 I'm processing lots of data in a 3D grid so I wanted to implement a simple iterator instead of three nested loops. However, I encountered a performance problem: first, I implemented a simple loop using only int x, y and z variables. Then I implemented a Vector3I structure and used that - and the calculation time doubled. Now I'm struggling with the question - why is that? What did I do wrong? Example for reproduction: using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using

Why is using structure Vector3I instead of three ints much slower in C#?

十年热恋 提交于 2021-02-18 21:42:26
问题 I'm processing lots of data in a 3D grid so I wanted to implement a simple iterator instead of three nested loops. However, I encountered a performance problem: first, I implemented a simple loop using only int x, y and z variables. Then I implemented a Vector3I structure and used that - and the calculation time doubled. Now I'm struggling with the question - why is that? What did I do wrong? Example for reproduction: using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using

Understanding `_mm_prefetch`

雨燕双飞 提交于 2021-02-10 00:24:39
问题 The answer What are _mm_prefetch() locality hints? goes into details on what the hint means. My question is: which one do I WANT ? I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to

Understanding `_mm_prefetch`

谁说我不能喝 提交于 2021-02-09 23:58:55
问题 The answer What are _mm_prefetch() locality hints? goes into details on what the hint means. My question is: which one do I WANT ? I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to

Understanding `_mm_prefetch`

人走茶凉 提交于 2021-02-09 23:57:45
问题 The answer What are _mm_prefetch() locality hints? goes into details on what the hint means. My question is: which one do I WANT ? I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to

Understanding `_mm_prefetch`

牧云@^-^@ 提交于 2021-02-09 23:51:14
问题 The answer What are _mm_prefetch() locality hints? goes into details on what the hint means. My question is: which one do I WANT ? I work on a function that is called repeatedly, billions of times, with some int parameter among others. First thing I do is to look up some cached value using that parameter (its low 32 bits) as a key into 4GB cache. Based on the algorithm from where this function is called, I know that most often that key will be doubled (shifted left by 1 bit) from one call to

About negate a sign-integer in mips?

瘦欲@ 提交于 2021-02-08 06:57:22
问题 I'm thinking about how to negate a signed-integer in mips32. My intuition is using definition of 2's complement like: (suppose $s0 is the number to be negated) nor $t0, $s0, $s0 ; 1's complement addiu $t0, $t0, 1 ; 2's = 1's + 1 then I realized that it can be done like: sub $t0, $zero, $s0 so... what's the difference? Which is faster? IIRC sub will try to detect overflow, but would this make is slower? Finally, is there any other way to do so? 回答1: subu $t0, $zero, $s0 is the best way, and is