Count points in a rectangle

前端 未结 9 1471
失恋的感觉
失恋的感觉 2020-12-29 23:53

I have lots (billions) of points in 2D which I can preprocess and I would like to answer queries which are of the following form:

Given all four corners of a rectang

相关标签:
9条回答
  • 2020-12-30 00:14

    I'd suggest finding a rotation+shift transformation that you can apply to your space, so that one corner of the rectangle is in (0,0) and two edges go along the x and y axes.

    Now you go through the points, apply the same transformation and just check for 0 < x < rectangle_max_x and 0 < y < rectangle_max_y.

    0 讨论(0)
  • 2020-12-30 00:15

    Old answer (if you cannot preprocess the points in advance):

    • Inscribe you rectangle in a containing rectangle with sides/edges oriented as the xy axis
    • Quickly exclude all points outside of it
    • Use the principle explained here: How to determine if a point is in a 2D triangle? with the four sides/edges of the rectangle (Note: since you are always using the same rectanglee to check all the points, you can pre-compute some of the values)

    You can maybe gain something (not much, it depends on the orientation of the rectangle) in performance by quickly including points that stays in the inscribed rectangle with sides/edges oriented as the xy axis. This requires some pre-computation, but is negligible given the fact that you have a lot of points.

    New answer:

    • rect is our input rectangle
    • Assume you have f1(point,rect) which checks if a point is inside a rectangle. You can use the one I mentioned above.
    • Assume you have f2(shape,rect), which is able to say if shape is completely contained in rect, or rect is completely contained in shape, or that shape and rect intersect or do not intersect at all
    • shape will be a polygon with a certain number of sides (not high or proportional to n, so that we can assume that f2 is O(1)), or a region in the 2D plane delimited by 2 sides and extending to infinite (e.g. the region delimited by the positive section of the xy axis)
    • Assume you have much time to preprocess the points, but not infinite. Let's say we can afford a O(n*log(n) ) algorithm

    What we want to obtain is an algorithm that at runtime calls f1 and f2 a minimum number of time. For example, something proportional to (of the same order of) log(n)

    So we want to divide our 2D plane in m shapes, each one containing p points. At runtime, we check each of the shapes with f2, and we can have 4 cases:

    1. The rectangle is completely contained in the shape: using f1 I count all the points contained in this shape that lay in the rectangle (O(p) ), and I end.
    2. The shape is completely contained in the rectangle: I add to my accumulator the whole number of points contained in the shape. (O(1) )
    3. The rectangle and the shape do not intersect: I skip this shape.
    4. The rectangle and the shape intersect: using f1 I count all the points contained in this shape that lay in the rectangle (O(p) ), and I continue.

    We can be lucky and drop quickly in case 1, but normally we will have to check all shapes, and for at least one of them we will have to check all the points contained. So this algorithm would be O(p) + O(m). Considering that p * m = n, if we chose p = m = sqrt(n) we obtain O(sqrt(n) ) which is the best we can obtain with this method. (Note: how many times do we execute f1? This number in fact depends on the shape of rectangle, so if for example the rectangle is very much elongated it will intersect with many regions, causing many calls to f1. However, I think we can assume that the measures of the rectangle are not in the same order of n, or sqrt(n) or even log(n): n is huge.)

    We could enhance from here; we could for example say that we have adjacencies among the shapes, and the first time I find an overlapping between a shape and the rectangle, I only check contiguous shapes. However, the average number of shapes we will have to check will be anyway around p/2, and O(p/2) = (O(p) ). So no effective gain.

    The real gain is if we put some hierarchy in the shapes.

    First of all, I check all my points, and find my bound values max_x, max_y, min_x, min_y. (Let's assume these boundaries are > > n. If we could have priors about the point distribution, the optimizations we could target would be completely different) We divide our space in shapes each one containing (around) log(n) points. We start by dividing the 2D plane in 4 shapes, using the xy axis (I could also center according to my bound values). This will be our first level of our upside-down pyramid. Cyclically: For each of the region that contains more than log(n) points, we split the region in half using a vertical or horizontal line (we alternate). If one boundary was set to infinite, to split in half I use the corresponding bound value. Each of the regions that was split contains a pointer to the regions in which it is split. The new regions are the second level of the pyramid. I keep dividing until all of my regions contains (about) log(n) points. When a region is split, it contains pointer to the "children" regions. I have built my pyramid. The top level of the upside-down pyramid contains n/log(n) shapes, which is pretty big, but it doesn't matter: what counts is that we have log(n) pyramid levels. Note: For each shape at each level we know how many points it contains. Note2: this pre-elaboration analyze in average each point one time per pyramid level, so its complexity is O(n*(log(n) ).

    When I get my rectangle in input, I use f2 to check the shapes at my first level.

    1. The rectangle is completely contained in the shape: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle (O(log(n))) I discard any other shape.
    2. The shape is completely contained in the rectangle: I add to my accumulator the whole number of points contained in the shape. Takes O(1)
    3. The rectangle and the shape do not intersect: I skip this shape.
    4. The rectangle and the shape intersect: I enter the children shapes of this region, if any, otherwise I use f1 to count the points inside the rectangle (O(log(n) ).

    Now the difficult part: how many shapes do we visit? Again, this is dependent on the shape of the rectangle, how many shapes it touches. However, for each level we will visit a number of shapes not depending on n, so the number of shapes visited is proportional to O(log(n) ).

    Since n is very big we can assume that the number of shapes intersected by the rectangle sides (which cause an expensive call to f1) are far less than O(log(n) ). The whole algorithm should be O(log(n) ).

    There are further way to optimize, but anything will stay in average O(log(n) )

    Final note: The way we divide the space has to be so that the number of sides the polygon have is controlled, because if shapes could have a big number of sides, somehow dependent on the number of points (according to a function that we call g), f2 would be O(g(n) ), and its complexity would have to be multiplied again by something depending on n, the number of shapes we have to check, so probably not good.

    0 讨论(0)
  • 2020-12-30 00:25

    What you need is some sort of binary space partitioning data structure. That'll get you a list of candidates for which you could do the real "point in polygon" test.

    I'd advise you to make sure that this is something you really ought to be coding on your own. For example, many DBs have this sort of functionality built in. Does your data actually reside in a DB? Could it? (no sense in reinventing wheels...)

    You can see a great answer for the Point in Polygon problem here: How can I determine whether a 2D Point is within a Polygon?

    0 讨论(0)
  • 2020-12-30 00:28

    Make triangle. Suppose, abcd is the rectangle and x is the point then if area(abx)+area(bcx)+area(cdx)+area(dax) equals area(abcd) then the point is inside it.

    0 讨论(0)
  • 2020-12-30 00:29

    I would start with presorting the points array along any axis (let it be x). Then binary searching it for the points with x-projection in rectangles' bounding box x-projection. That along would of reduce the numbers of points to check drastically.

    Then we can filter the points further just by checking if they are in rectangles bounding box. But yes, that would be linear.

    Then we may take a transformation matrix for the rectangle (I assume we already have it). Rectangle is an affine transformation of singular 2-cube, therefore we can find reverse transformation without calculating an actual inverse matrix.

    For direct transform matrix of

    A D a
    B E b
    C F c
    

    the solution would be:

    d = 1/(AE − BD)
    
    A' = Ed
    B' = −Bd
    C' = (BF − EC)d
    
    D' = −Dd
    E' = Ad
    F' = (DC − AF)d
    
    a' = b' = 0
    c' = 1
    

    Then, by applying inverse transform to every point, we will either translate it into a singular cube, which is (0, 1)x(0, 1), if it originally lies in a rectangle, or not if it doesn't.

    UPD: Or, instead of all the transformation stuff, you can do the following:

    Let points of rectangle be P1..P4 and the point to check A.

    For i = 1..4 calculate PAi as Pi - A

    Now the cross product of (Pi.x, Pi.y, 0)x(Pj.x, Pj.y, 0) would measure the triangle made by an A and corresponding rectangles' edge. And, as original point are all on xy plane, the result would be like (0, 0, Sij), where Sij is the signed square of a triangle. Just calculate the sum:

    |(P1.x, P1.y, 0)x(P2.x, P2.y, 0)[3]|+
    |(P2.x, P2.y, 0)x(P3.x, P3.y, 0)[3]|+
    |(P3.x, P3.y, 0)x(P4.x, P4.y, 0)[3]|+
    |(P4.x, P4.y, 0)x(P1.x, P1.y, 0)[3]|
    

    And compare it with the rectangles square. If it is more or less equal, then the point is in the rectangle. There would be some small computational error, so the exact equality is out of the question.

    0 讨论(0)
  • 2020-12-30 00:32

    Represent the points as a k-d tree.

    That is, a binary tree in which every node represents one of the points and every non-leaf node can be thought of as splitting the current area either vertically or horizontally (alternatingly on each level) on that node's x or y value.

    Then, to do a query:

    1. Current node = the root

    2. Current area = the area of the current node (can be kept track of / calculated as you recurse down the tree)

    3. If current area is completely contained inside the rectangle, add the number of points this node has as children (+1 for the current node).

    4. If current area is completely outside inside the rectangle, do nothing.

    5. If current area is partially contained inside rectangle:

      • Add the current node's point if it is contained in the rectangle.
      • Repeat from 2. for both children.

    Calculating whether an area or point is contained in a rectangle should be simple enough.

    Each query should take O(log n) time on average on random data.

    Although there would exist pathological cases where it would take O(n) time (i.e. where you'll need to explore the entire / most of the tree). One possible example of such a case is when most of the points are around the edges of a (non-axis-aligned) rectangle (either inside or outside), meaning the "do nothing" part mentioned above will rarely apply.

    0 讨论(0)
提交回复
热议问题