- Avoid memory fragmentation.
- Aligned memory.
- SIMD instructions.
- Lockless multithreading.
- Use proper acceleration trees, such as kd-tree, cover tree, octree, quadtree, etc.
5a. Define these in ways that allow for the first three (ie make nodes all in one block)
- inlining. The lowest hanging but quite delicious fruit.
The performance boosts you can get this way are astonishing. For me 1500 times for a computation heavy app. Not over brute fore, but over similar data structures written in a major software package.
I'd not bother with stuck like preincrement over post. That only gives savings in certains (unimportant) cases and most of what's mentioned is similar stuff that might scrape out an extra 1% here and there once in a while but usually isn't worth the bother.