I was reading a blog post by a game coder for Introversion and he is busily trying to squeeze every CPU tick he can out of the code. One trick he mentions off-hand is to
We have slightly different guidelines for members here (ARM architecture target, mostly THUMB 16-bit codegen for various reasons):
"group by alignment" is somewhat obvious, and outside the scope of this question; it avoids padding, uses less memory, etc.
The second bullet, though, derives from the small 5-bit "immediate" field size on the THUMB LDRB (Load Register Byte), LDRH (Load Register Halfword), and LDR (Load Register) instructions.
5 bits means offsets of 0-31 can be encoded. Effectively, assuming "this" is handy in a register (which it usually is):
If they're outside this range, multiple instructions have to be generated: either a sequence of ADDs with immediates to accumulate the appropriate address in a register, or worse yet, a load from the literal pool at the end of the function.
If we do hit the literal pool, it hurts: the literal pool goes through the d-cache, not the i-cache; this means at least a cacheline worth of loads from main memory for the first literal pool access, and then a host of potential eviction and invalidation issues between the d-cache and i-cache if the literal pool doesn't start on its own cache line (i.e. if the actual code doesn't end at the end of a cache line).
(If I had a few wishes for the compiler we're working with, a way to force literal pools to start on cacheline boundaries would be one of them.)
(Unrelatedly, one of the things we do to avoid literal pool usage is keep all of our "globals" in a single table. This means one literal pool lookup for the "GlobalTable", rather than multiple lookups for each global. If you're really clever you might be able to keep your GlobalTable in some sort of memory that can be accessed without loading a literal pool entry -- was it .sbss?)