Query by coordinates takes too long - options to optimize?

问题

I have a table where I store events (about 5M at the moment, but there will be more). Each event has two attributes that I care about for this query - location (latitude and longitude pair) and relevancy.

My goal is: For a given location bounds (SW / NE latitude/longitude pairs, so 4 floating point numbers) return the top 100 events by relevancy which fall within those bounds.

I'm currently using the following query:

select * 
from event 
where latitude >= :swLatitude 
and latitude <= :neLatitude 
and longitude >= :swLongitude 
and longitude <= :neLongitude 
order by relevancy desc 
limit 100

Let's put aside for the moment the issue of date-line wrap-around which this query doesn't handle.

This works fine for smaller location bounds, but lags rather badly whenever I try to use larger location bounds.

I've defined the following index:

CREATE INDEX latitude_longitude_relevancy_index
  ON event
  USING btree
  (latitude, longitude, relevancy);

The table itself is quite straightforward:

CREATE TABLE event
(
  id uuid NOT NULL,
  relevancy double precision NOT NULL,
  data text,
  latitude double precision NOT NULL,
  longitude double precision NOT NULL
  CONSTRAINT event_pkey PRIMARY KEY (id)
)

I tried explain analyze and got the following, which I think means the index isn't even used:

"Limit  (cost=1045499.02..1045499.27 rows=100 width=1249) (actual time=14842.560..14842.575 rows=100 loops=1)"
"  ->  Sort  (cost=1045499.02..1050710.90 rows=2084754 width=1249) (actual time=14842.557..14842.562 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 351kB"
"        ->  Seq Scan on event  (cost=0.00..965821.22 rows=2084754 width=1249) (actual time=3090.660..12525.695 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3334584"
"Total runtime: 14866.532 ms"

I'm using PostgreSQL 9.3 on Win7 and it seems like overkill to move to anything else for this seemingly simple task.

Questions:

Any ways to use different indexes to help the current query be faster?
Any ways to rewrite the current query to be faster?
What's the easiest way to do this right? Install PostGIS and use the GEOGRAPHYdata type? Is that going to actually provide a performance benefit to what I'm doing now? Which PostGIS function would work best for this query?

Edit #1: Results for vacuum full analyze:

INFO:  vacuuming "public.event"
INFO:  "event": found 0 removable, 5397347 nonremovable row versions in 872213 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU 17.73s/11.84u sec elapsed 154.24 sec.
INFO:  analyzing "public.event"
INFO:  "event": scanned 30000 of 872213 pages, containing 185640 live rows and 0 dead     rows; 30000 rows in sample, 5397344 estimated total rows
Total query runtime: 360092 ms.

Results after the vacuum:

"Limit  (cost=1058294.92..1058295.17 rows=100 width=1216) (actual time=6784.111..6784.121 rows=100 loops=1)"
"  ->  Sort  (cost=1058294.92..1063405.89 rows=2044388 width=1216) (actual time=6784.109..6784.113 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 203kB"
"        ->  Seq Scan on event  (cost=0.00..980159.88 rows=2044388 width=1216) (actual time=0.043..6412.570 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3414134"
"Total runtime: 6784.170 ms"

回答1:

You will be much better off using a spatial index which uses an R-tree (essentially a two-dimensional index, which operates by dividing the space into boxes), and will perform far better than greater than, less than comparisons on two separate lat, lon values on this kind of query. You will need to create a geometry type first though, which you then index and use in your query instead of the separate lat/lon pairs you are currently using.

The following will create a geometry type, populate it, and add an index to it, ensuring that it is a point and in lat/lon, known as EPSG:4326

alter table event add column geom geometry(POINT, 4326);
update event set geom=ST_SetSrid(ST_MakePoint(lon, lat), 4326);
create index ix_spatial_event_geom on event using gist(geom);

Then you can run the following query to get your events, which will use a spatial intersects, which should utilize your spatial index:

Select * from events where ST_Intersects(ST_SetSRID(ST_MakeBox2D(ST_MakePoint(swLon, swLat), 
    ST_MakePoint(neLon, neLat)),4326), geom) 
order by relevancy desc limit 100;

You make the bounding box for your intersection by using ST_MakeBOX2D with two sets of points, which will be on diagonal corners of the bounding box, so the SW and NE or the NW and SE pairs, would both work.

When you run explain on this, you should find that the spatial index is included. This will perform vastly better than two separate indexes on lon and lat columns, as you are only hitting one indexed, optimized for spatial search, rather than two B-trees. I realize that this represents another way of doing it and does not answer you original question, except indirectly.

EDIT: Mike T has made the very good point that for bounding box searches in 4326, it is more appropriate and quicker to use a geometry data type and the && operator as the SRID will be ignored anyway, eg,

 where ST_MakeBox2D(ST_MakePoint(swLon, swLat), ST_MakePoint(neLon, neLat)) && geom

来源：https://stackoverflow.com/questions/24212355/query-by-coordinates-takes-too-long-options-to-optimize

标签

sql

postgresql

indexing

PostGIS