5. Loose/Tight Double-Grid With 500k Agents
Above is a little GIF showing 500,000 agents bouncing off of each other every time step using a new "loose/tight grid" data structure I was inspired to create after writing the answer about loose quadtrees. I'll try to go over how it works.
It is the best-performing data structure so far among all the ones I've shown that I've implemented (though it could just be me), handling half a million agents better than the initial quadtree handled 100k, and better than the loose quadtree handled 250k. It also requires the least amount of memory and has the most stable memory use among these three. This is all still working in just one thread, no SIMD code, no fancy micro-optimizations as I typically apply for production code -- just a straightforward implementation from a couple hours of work.
I also improved the drawing bottlenecks without improving my rasterization code whatsoever. It's because the grid lets me easily traverse it in a way that's cache-friendly for image processing (drawing the elements in the cells of the grid one-by-one coincidentally happens to lead to very cache-friendly image processing patterns when rasterizing).
Funnily enough it also took the shortest time for me to implement (only 2 hours while I spent 5 or 6 hours on the loose quadtree), and it also requires the least amount of code (and arguably has the simplest code). Though that might just be because I've accumulated so much experience implementing grids.
Loose/Tight Double Grid
So I covered how grids worked in the fundamentals section (see part 2), but this is a "loose grid". Each grid cell stores its own bounding box that is allowed to shrink as elements are removed and grow as elements are added. As a result each element only needs to be inserted once into the grid based on which cell its center position lands inside, like so:
// Ideally use multiplication here with inv_cell_w or inv_cell_h.
int cell_x = clamp(floor(elt_x / cell_w), 0, num_cols-1);
int cell_y = clamp(floor(ely_y / cell_h), 0, num_rows-1);
int cell_idx = cell_y*num_rows + cell_x;
// Insert element to cell at 'cell_idx' and expand the loose cell's AABB.
Cells store elements and AABBs like this:
struct LGridLooseCell
{
// Stores the index to the first element using an indexed SLL.
int head;
// Stores the extents of the grid cell relative to the upper-left corner
// of the grid which expands and shrinks with the elements inserted and
// removed.
float l, t, r, b;
};
However, loose cells pose a conceptual problem. Given that they have these variable-sized bounding boxes which can grow huge if we insert a huge element, how do we avoid checking every single freaking cell of the grid when we want to find out which loose cells and corresponding elements intersect a search rectangle? There could be a case where we're searching the upper-right corner of the loose grid but there's a cell in the lower-left on the opposite side which has grown large enough to intersect that area as well. Without a solution to this problem, we're reduced to having to check all the loose cells in linear time for matches.
... and the solution is that actually this is a "double grid". The loose grid cells themselves are partitioned into a tight grid. When we do the analogical search above, we first look through the tight grid like so:
tx1 = clamp(floor(search_x1 / cell_w), 0, num_cols-1);
tx2 = clamp(floor(search_x2 / cell_w), 0, num_cols-1);
ty1 = clamp(floor(search_y1 / cell_h), 0, num_rows-1);
ty2 = clamp(floor(search_y2 / cell_h), 0, num_rows-1);
for ty = ty1, ty2:
{
trow = ty * num_cols
for tx = tx1, tx2:
{
tight_cell = tight_cells[trow + tx];
for each loose_cell in tight_cell:
{
if loose_cell intersects search area:
{
for each element in loose_cell:
{
if element intersects search area:
add element to query results
}
}
}
}
}
Tight cells store a singly-linked index list of loose cells, like so:
struct LGridLooseCellNode
{
// Points to the next loose cell node in the tight cell.
int next;
// Stores an index to the loose cell.
int cell_idx;
};
struct LGridTightCell
{
// Stores the index to the first loose cell node in the tight cell using
// an indexed SLL.
int head;
};
And voila, that's the basic idea of the "loose double-grid". When we insert an element, we expand the loose cell's AABB just as we do for a loose quadtree, only in constant-time instead of logarithmic time. However, we also insert the loose cell to the tight grid in constant time based on its rectangle (and it could be inserted into multiple cells).
The combo of these two (tight grid to rapidly find loose cells, and loose cells to rapidly find elements) gives a very lovely data structure where each element is inserted into a single cell with constant time searches, insertion, and removals.
The only big downside I see is that we do have to store all these cells and potentially still have to search more cells than we need, but they're still reasonably cheap (20 bytes per cell in my case) and it's easy to traverse the cells on searches in a very cache-friendly access pattern.
I recommend giving this idea of "loose grids" a try. It is arguably a lot easier to implement than quadtrees and loose quadtrees and, more importantly, optimize, since it immediately lends itself to a cache-friendly memory layout. As a super cool bonus, if you can anticipate the number of agents in your world in advance, it is almost 100% perfectly stable and immediately in terms of memory use, since an element always occupies exactly one cell, and the number of cells total is fixed (since they don't subdivide/split). The only minor instability in memory use is that those loose cells might expand and get inserted into additional tight cells in the coarser grid every now and then, yet this should be quite infrequent. As a result, the memory use is very stable/consistent/predictable and also often the corresponding frame rates. That could be a huge bonus for certain hardware and software where you want to pre-allocate all memory in advance.
It's also really easy to make it work with SIMD to do multiple coherent queries simultaneously with vectorized code (in addition to multithreading), since the traversal, if we can even call it that, is flat (it's just a constant-time lookup into a cell index which involves some arithmetic). As a result it's fairly easy to apply optimization strategies similar to ray packets that Intel applies to their raytracing kernel/BVH (Embree) to test for multiple coherent rays simultaneously (in our case they would be "agent packets" for collision), except without such fancy/complex code since grid "traversal" is so much simpler.
On Memory Use and Efficiency
I covered this a bit in part 1 on efficient quadtrees, but reducing memory use is often the key to speed these days since our processors are so fast once you get the data into, say, L1 or a register, but DRAM access is relatively so, so slow. We have so precious little fast memory still even if we have an insane amount of slow memory.
I think I'm kind of lucky starting off at a time when we had to be very frugal with memory use (though not as much as the people before me), where even a megabyte of DRAM was considered amazing. Some of the things I learned back then and habits I picked up (even though I'm far from an expert) coincidentally aligns with performance. Some of it I've had to discard as bad habits which are counter-productive these days and I've learned to embrace waste in areas where it doesn't matter. A combination of the profiler and tight deadlines helps keep me productive and not end up with priorities that are too out of whack.
So one general piece of advice I'd offer on efficiency in general, not just spatial indexes used for collision detection, is watch that memory use. If it's explosive, chances are that the solution won't be very efficient. Of course there's a grey zone where using a bit more memory for a data structure can substantially reduce processing to the point where it's beneficial only considering speed, but a lot of times reducing the amount of memory required for data structures, especially the "hot memory" that is accessed repeatedly, can translate quite proportionally to a speed improvement. All the least efficient spatial indexes I encountered in my career were the most explosive in memory use.
It's helpful to look at the amount of data you need to store and calculate, at least roughly, how much memory it should ideally require. Then compare it to how much you actually require. If the two are worlds apart, then you'll likely get a decent boost trimming down the memory use because that'll often translate to less time loading chunks of memory from the slower forms of memory in the memory hierarchy.