micro-optimization | 易学教程

Fast search of some nibbles in two ints at same offset (C, microoptimisation)

阅读更多关于 Fast search of some nibbles in two ints at same offset (C, microoptimisation)

问题 My task is to check (>trillions checks), does two int contain any of predefined pairs of nibbles (first pair 0x2 0x7; second 0xd 0x8). For example: bit offset: 12345678 first int: 0x3d542783 first pair of 0x2 second: 0xd second int: 0x486378d9 nibbles: 0x7 pair: 0x8 ^ ^ So, for this example I mark two offsets with needed pairs (offsets are 2 and 5; but not a 7). Actual offsets and number of found pair are not needed in my task. So, for given two ints the question is: Does them contains the

How efficient is an if statement compared to a test that doesn't use an if? (C++)

阅读更多关于 How efficient is an if statement compared to a test that doesn't use an if? (C++)

问题 I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y" int a, b, low; if (a < b) low = a; else low = b; is more or less efficient than this: int a, b, low; low = b + ((a - b) & ((a - b) >> 31)); (or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that). I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of

Preserving the Execution pipeline

阅读更多关于 Preserving the Execution pipeline

Return types are frequently checked for errors. But, the code that will continue to execute may be specified in different ways. if(!ret) { doNoErrorCode(); } exit(1); or if(ret) { exit(1); } doNoErrorCode(); One way heavyweight CPU's can speculate about the branches taken in near proximity/locality using simple statistics - I studied a 4-bit mechanism for branch speculation (-2,-1,0,+1,+2) where zero is unknown and 2 will be considered a true branch. Considering the simple technique above, my questions are about how to structure code. I assume that there must be a convention among major

Optimize CSS: Narrow Definition (#mytable tbody span.myclass) better?

阅读更多关于 Optimize CSS: Narrow Definition (#mytable tbody span.myclass) better?

I wondered whether or not a 'narrow' definition such as #mytable tbody span.myclass { color: #ffffff; } is better/faster to parse than just .myclass { color: #ffffff; } I read somewhere that narrow definitions supposedly actually have some kind of adversery effect on CSS speed, but I can't remember where and it's been a while already so I just wanted to clarify if it matters or not, and if it does, which solution is better/faster. Thank you! Google's Page Speed has some information regarding using efficient CSS selectors . I suggest starting there. So (very) basically, they recommend to: Avoid

Optimize CSS: Narrow Definition (#mytable tbody span.myclass) better?

阅读更多关于 Optimize CSS: Narrow Definition (#mytable tbody span.myclass) better?

问题 I wondered whether or not a 'narrow' definition such as #mytable tbody span.myclass { color: #ffffff; } is better/faster to parse than just .myclass { color: #ffffff; } I read somewhere that narrow definitions supposedly actually have some kind of adversery effect on CSS speed, but I can't remember where and it's been a while already so I just wanted to clarify if it matters or not, and if it does, which solution is better/faster. Thank you! 回答1: Google's Page Speed has some information

Preserving the Execution pipeline

阅读更多关于 Preserving the Execution pipeline

问题 Return types are frequently checked for errors. But, the code that will continue to execute may be specified in different ways. if(!ret) { doNoErrorCode(); } exit(1); or if(ret) { exit(1); } doNoErrorCode(); One way heavyweight CPU's can speculate about the branches taken in near proximity/locality using simple statistics - I studied a 4-bit mechanism for branch speculation (-2,-1,0,+1,+2) where zero is unknown and 2 will be considered a true branch. Considering the simple technique above, my

Only pass if-statement once

阅读更多关于 Only pass if-statement once

问题 I am currently building a kernel, and have an if-statement that could (at worst) run a few million times. Yet, the result is clear after the first run. Knowing that the result of cmp is stored in a register, is there a way of remembering the result of abovementioned statement in order to not run it more often? acpi_version is GUARANTEED to never change. SDT::generic_sdt* sdt_wrapper::get_table (size_t index) { //function is run many times with varying index if (index >= number_tables) { //no

How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?

阅读更多关于 How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?

问题 Basically I'm having a hard time getting the execution time any lower than it is, as well as reducing the amount of clock cycles and memory size. Does anyone have any idea on how I can do this? The code works fine I just want to change it a bit. Wrote a working code, but don't want to mess up the code, but also don't know what changes to make. ; Calculation of a factorial value using a simple loop ; set up the exception addresses THUMB AREA RESET, CODE, READONLY EXPORT __Vectors EXPORT Reset

According to Intel my cache should be 24-way associative though its 12-way, how is that?

阅读更多关于 According to Intel my cache should be 24-way associative though its 12-way, how is that?

问题 According to “Intel 64 and IA-32 architectures optimization reference manual,” April 2012 page 2-23 The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a hash function, such that addresses are uniformly distributed. The data array in a cache block may have 4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution among the cache blocks from the software point of view, this does not appear as a normal N

Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

阅读更多关于 Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

问题 I understand it's important to use VZEROUPPER when mixing SSE and AVX code but what if I only use AVX (and ordinary x86-64 code) without using any legacy SSE instructions? If I never use a single SSE instruction in my code, is there any performance reason why I would ever need to use VZEROUPPER ? This is assuming I'm not calling into any external libraries (that might be using SSE). 回答1: You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers,