Nearest neighbor search in 2D using a grid partitioning

I have a fairly large set of 2D points (~20000) in a set, and for each point in the x-y plane want to determine which point from the set is closest. (Actually, the points are of different types, and I just want to know which type is closest. And the x-y plane is a bitmap, say 640x480.)

From this answer to the question "All k nearest neighbors in 2D, C++" I got the idea to make a grid. I created n*m C++ vectors and put the points in the vector, depending on which bin it falls into. The idea is that you only have to check the distance of the points in the bin, instead of all points. If there is no point in the bin, you continue with the adjacent bins in a spiralling manner.

Unfortunately, I only read Oli Charlesworth's comment afterwards:

Not just adjacent, unfortunately (consider that points in the cell two to the east may be closer than points in the cell directly north-east, for instance; this problem gets much worse in higher dimensions). Also, what if the neighbouring cells happen to have less than 10 points in them? In practice, you will need to "spiral out".

Fortunately, I already had the spiraling code figured out (a nice C++ version here, and there are other versions in the same question). But I'm still left with the problem:

If I find a hit in a cell, there could be a closer hit in an adjacent cell (yellow is my probe, red is the wrong choice, green the actual closest point):
If I find a hit in an adjacent cell, there could be a hit in a cell 2 steps away, as Oli Charlesworth remarked:
But even worse, if I find a hit in a cell two steps away, there could still be a closer hit in a hit three steps away! That means I'd have to consider all cells with dx,dy= -3...3, or 49 cells!

Now, in practice this won't happen often, because I can choose my bin size so the cells are filled enough. Still, I'd like to have a correct result, without iterating over all points.

So how do I find out when to stop "spiralling" or searching? I heard there is an approach with multiple overlapping grids, but I didn't quite understand it. Is it possible to salvage this grid technique?

Since the dimensions of your bitmap are not large and you want to calculate the closest point for every (x,y), you can use dynamic programming.

Let V[i][j] be the distance from (i,j) to the closest point in the set, but considering only the points in the set that are in the "rectangle" [(1, 1), (i, j)].

Then V[i][j] = 0 if there is a point in (i, j), or V[i][j] = min(V[i'][j'] + dist((i, j), (i', j'))) where (i', j') is one of the three neighbours of (i,j):

i.e.

(i - 1, j)
(i, j - 1)
(i - 1, j - 1)

This gives you the minimum distance, but only for the "upper left" rectangle. We do the same for the "upper right", "lower left", and "lower right" orientations, and then take the minimum.

The complexity is O(size of the plane), which is optimal.

For you task usually a Point Quadtree is used, especially when the points are not evenly distributed.

To save main memory you als can use a PM or PMR-Quadtree which uses buckets.

You search in your cell and in worst case all quad cells surounding the cell.

You can also use a k-d tree.

One solution would be to construct multiple partitionings with different grid sizes.

Assume you create partitions at levels 1,2,4,8,..

Now, search for a point in grid size 1 (you are basically searching in 9 squares). If there is a point in the search area and if distance to that point is less than 1, stop. Otherwise move on to the next grid size.

The number of grids you need to construct is about twice as compared to creating just one level of partitioning.

A solution im trying

First make a grid such that you have an average of say 1 (more if you want larger scan) points per box.
Select the center box. Continue selecting neighbor boxes in a circular manner until you find at least one neighbor. At this point you can have 1 or 9 or so on boxes selected
Select one more layer of adjacent boxes
Now you have a fairly small list of points, usually not more than 10 which you can punch into the distance formula to find the nearest neighbor.

Since you have on average 1 points per box, you will mostly be selecting 9 boxes and comparing 9 distances. Can adjust grid size according to your dataset properties to achieve better results.

Also, if your data has a lot of variance, you can try 2 levels of grid (or even more) so if selection works and returns more than 50 points in a single query, start a next grid search with a grid 1/10th the size ...

来源：https://stackoverflow.com/questions/15820226/nearest-neighbor-search-in-2d-using-a-grid-partitioning

标签

algorithm

geometry

nearest-neighbor