Another Why Is This Nearest Neighbor Spatial Query So Slow?

问题

Following this recommendation for an optimized nearest neighbor update, I'm using the below tsql to update a GPS table of 11,000 points with the nearest point of interest to each point.

WHILE (2 > 1) 
  BEGIN 
    BEGIN TRANSACTION 
    UPDATE TOP ( 100 ) s 
set 
[NEAR_SHELTER]= fname,
[DIST_SHELTER] = Shape.STDistance(fshape)
from(
Select
[dbo].[GRSM_GPS_COLLAR].*,
fnc.NAME as fname,
fnc.Shape as fShape
from
[dbo].[GRSM_GPS_COLLAR]
CROSS APPLY (SELECT TOP 1 NAME, shape                   
FROM [dbo].[BACK_COUNTRY_SHELTERS] WITH(index ([S50_idx]))                
WHERE [BACK_COUNTRY_SHELTERS].Shape.STDistance([dbo].[GRSM_GPS_COLLAR].Shape) IS NOT NULL
                  ORDER BY BACK_COUNTRY_SHELTERS.Shape.STDistance([dbo].[GRSM_GPS_COLLAR].Shape) ASC) fnc)s; 

    IF @@ROWCOUNT = 0 
      BEGIN 
        COMMIT TRANSACTION 
         BREAK 
      END 
    COMMIT TRANSACTION 
    -- 1 second delay
    WAITFOR DELAY '00:00:01'
  END -- WHILE
GO

Note that I'm doing it in chunks of 100 to avoid locking, which I get if I don't chunk it up, and it runs for hours before I have to kill it. The obvious answer is "Have you optimized your spatial indexes" and the answer is yes, both tables have a spatial index (SQL 2012), Geography Autogrid, 4092 cells per object, which was found to be the most efficient index after many days of testing every possible permutation of index parameters. I have tried this with and without the spatial index hint....with multiple spatial indexes.

In the above, note the spatial index seek cost and the warning about no column statistics, which I understand is the case with spatial indexes. In each case I eventually have to terminate the tsql. It just runs forever (in one case overnight, with 2300 rows updated).

I've tried Isaac's numbers table join solution, but that example doesn't appear to lend itself to looping through n distance searches, just a single user-supplied location (@x).

Update

@ Brad D based your answer, I tried this, with some syntax errors that I can't quite figure out...I'm not sure I'm converting your example to mine correctly. Any ideas what I'm doing wrong? Thanks!

;WITH Points as(
SELECT TOP 100 [NAME], [Shape] as GeoPoint
FROM [BACK_COUNTRY_SHELTERS]
WHERE 1=1 


SELECT P1.*, CP.[GPS_POS_NUMBER] as DestinationName, CP.Dist
INTO #tmp_Distance
FROM [GRSM_GPS_COLLAR] P1
CROSS APPLY (SELECT [NAME] ,    Shape.STDistance(P1.GeoPoint)/1609.344 as     Dist
FROM [BACK_COUNTRY_SHELTERS] as P2
WHERE 1=1 
AND P1.[NAME] <> P2.[NAME] --Don't compare it to itself

) as CP

CREATE CLUSTERED INDEX tmpIX ON #tmp_Distance (name, Dist)


SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Dist ASC) as Rnk FROM #tmp_Distance) as tbl1
WHERE rnk = 1
DROP TABLE #tmp_Distance

回答1:

You're essentially comparing 121 million data points (11K Origins to 11K destinations) this isn't going to scale well trying to do it all in one fell swoop. I like your idea of breaking it into batches, but trying to do an ordering of a result set of 1.1MM records without an index could be painful.

I suggest breaking this out into a few more operations. I just tried the following and it runs in under a minute per batch in my environment. (5500 location records)

This was able to work for me, without a geospatial index, but a clusted index around the origin and the distance to the destination.

;WITH Points as(
SELECT TOP 100 Name, AddressLine1,
    AddressLatitude, AddressLongitude
    , geography::STGeomFromText('POINT(' + CONVERT(varchar(50),AddressLatitude) + ' ' + CONVERT(varchar(50),AddressLongitude)     + ')',4326) as GeoPoint
FROM ServiceFacility
WHERE 1=1 
AND AddressLatitude BETWEEN -90 AND 90
AND AddressLongitude BETWEEN -90 AND 90)

SELECT P1.*, CP.Name as DestinationName, CP.Dist
INTO #tmp_Distance
FROM Points P1
CROSS APPLY (SELECT Name, AlternateName,
    geography::STGeomFromText('POINT(' + CONVERT(varchar(50),P2.AddressLatitude) + ' ' + CONVERT(varchar(50),P2.AddressLongitude) + ')',4326).STDistance(P1.GeoPoint)/1609.344 as     Dist
FROM ServiceFacility as P2
WHERE 1=1 
AND P1.Name <> P2.Name --Don't compare it to itself
AND P2.AddressLatitude BETWEEN -90 AND 90
AND P2.AddressLongitude BETWEEN -90 AND 90
) as CP

CREATE CLUSTERED INDEX tmpIX ON #tmp_Distance (name, Dist)


SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Dist ASC) as Rnk FROM #tmp_Distance) as tbl1
WHERE rnk = 1
DROP TABLE #tmp_Distance

The actual update on 100 records, or even 11000 records shouldn't take too long. Spatial index's are cool, but incase I'm missing something I don't see a hard stop requirement for this for this particular exercise.

回答2:

You should redesign the process. Not just tune indexes. Create a copy of the table with the columns you need. You can work in batches of several thousands if you work with much larger tables. Then for the side table, set for each the "closest" point. Then run a loop updating the main table in batches of under 5k (so not to cause a table lock escalation) using the clustered index for joining. This is usually much faster, and safer than running large scale updates on active tables. On the side table, add a "handled" column for the loop to update the main table. And an index on the handled column and the clustered index columns of the main table to prevent uneeded "sorting" when joining to the main table.

来源：https://stackoverflow.com/questions/32157502/another-why-is-this-nearest-neighbor-spatial-query-so-slow

标签

sql

sql-server

tsql

nearest-neighbor

spatial-index