问题
I want to write a query using Postgres and PostGIS. I'm also using Rails with rgeo
, rgeo-activerecord
and activerecord-postgis-adapter
, but the Rails stuff is rather unimportant.
The table structure:
measurement
- int id
- int anchor_id
- Point groundtruth
- data (not important for the query)
Example data:
id | anchor_id | groundtruth | data
-----------------------------------
1 | 1 | POINT(1 4) | ...
2 | 3 | POINT(1 4) | ...
3 | 2 | POINT(1 4) | ...
4 | 3 | POINT(1 4) | ...
-----------------------------------
5 | 2 | POINT(3 2) | ...
6 | 4 | POINT(3 2) | ...
-----------------------------------
7 | 1 | POINT(4 3) | ...
8 | 1 | POINT(4 3) | ...
9 | 1 | POINT(4 3) | ...
10 | 5 | POINT(4 3) | ...
11 | 3 | POINT(4 3) | ...
This table is some kind of manually created view
for faster lookups (with millions of rows). Else we'd have to join 8 tables and it would get even slower. But that's not part of the problem.
Simple version:
Parameters:
- Point
p
- int
d
What the query should do:
1. The query looks for all groundtruth
Points which have a distance < d
from Point p
SQL for that is pretty easy: WHERE st_distance(groundtruth, p) < d
2. Now we have a list of groundtruth
points with their anchor_id
s. As you can see in the table above, it is possible to have multiple identical groundtruth-anchor_id tuples. For example: anchor_id=3
and groundtruth=POINT(1 4)
.
3. Next I'd like to eliminate the identical tuples, by choosing one of them randomly(!). Why not simply take the first? Because the data
column is different.
Choosing a random row in SQL: SELECT ... ORDER BY RANDOM() LIMIT 1
My problem with all of this is: I can imagine a solution using SQL LOOP
s and lot's of subqueries, but there's for sure a solution using GROUP BY
or some other methods which will make it faster.
Full version:
Basically the same as above with one difference: The input parameters change:
- lot's of Points
p1
...p312456345
- still one
d
If the simple query is working, this could be done using a LOOP
in SQL. But maybe there is a better (and faster) solution, because the database is really huge!
Solution
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT DISTINCT ON (anchor_id, groundtruth)
*
FROM measurement m, ps
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(m.groundtruth, ps.p) < d
)
ORDER BY anchor_id, groundtruth, random();
Thanks to Erwin Brandstetter!
回答1:
To eliminate duplicates, this might be the most efficient query in PostgreSQL:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
More about this query style:
- Select first row in each GROUP BY group?
As mentioned in the comments this gives you an arbitrary pick. If you need random, somewhat more expensive:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement
WHERE st_distance(p, groundtruth) < d
ORDER BY anchor_id, groundtruth, random()
The second part is harder to optimize. EXISTS semi-join will probably be the fastest choice. For a given table ps (p point)
:
SELECT DISTINCT ON (anchor_id, groundtruth) *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
ORDER BY anchor_id, groundtruth, random();
This can stop evaluating as soon as one p
is close enough and it keeps the rest of the query simple.
Be sure to back that up with a matching GiST index.
If you have an array as input, create a CTE with unnest() on the fly:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT ...
Update according to comment
If you only need a single row as answer, you can simplify:
WITH ps AS (SELECT unnest(p_array) AS p)
SELECT *
FROM measurement m
WHERE EXISTS (
SELECT 1
FROM ps
WHERE st_distance(ps.p, m.groundtruth) < d
)
LIMIT 1;
Faster with ST_DWithin()
Probably more efficient with the function ST_DWithin() (and a matching GiST index!).
To get one row (using a sub-select instead of a CTE here):
SELECT *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
LIMIT 1;
To get one row for every point p
within distance d
:
SELECT DISTINCT ON (ps.p) *
FROM measurement m
JOIN (SELECT unnest(p_array) AS p) ps ON ST_DWithin(ps.p, m.groundtruth, d)
Adding ORDER BY random()
will make this query more expensive. Without random()
, Postgres can just pick the first matching row from the GiST index. Else all possible matches have to be retrieved and ordered randomly.
BTW, LIMIT 1
inside EXISTS
is pointless. Read the manual at the link I provided or this related question.
回答2:
I now cracked it, but the query is pretty slow...
WITH
ps AS (
SELECT unnest(p_array)
) AS p
),
gtps AS (
SELECT DISTINCT ON(ps.p)
ps.p, m.groundtruth
FROM measurement m, ps
WHERE st_distance(m.groundtruth, ps.p) < d
ORDER BY ps.p, RANDOM()
)
SELECT DISTINCT ON(gtps.p, gtps.groundtruth, m.anchor_id)
m.id, m.anchor_id, gtps.groundtruth, gtps.p
FROM measurement m, gtps
ORDER BY gtps.p, gtps.groundtruth, m.anchor_id, RANDOM()
My test database contains 22000 rows and I gave it two input values and it takes about 700ms. At the end there can be hundreds of input values :-/
The result now looks like this:
id | anchor_id | groundtruth | p
-----------------------------------------
20 | 1 | POINT(0 2) | POINT(1 0)
14 | 3 | POINT(0 2) | POINT(1 0)
5 | 8 | POINT(0 2) | POINT(1 0)
42 | 2 | POINT(4 1) | POINT(2 2)
11 | 3 | POINT(4 8) | POINT(4 8)
4 | 6 | POINT(4 8) | POINT(4 8)
1 | 1 | POINT(6 2) | POINT(7 3)
9 | 5 | POINT(6 2) | POINT(7 3)
25 | 3 | POINT(6 2) | POINT(9 1)
13 | 6 | POINT(6 2) | POINT(9 1)
18 | 7 | POINT(6 2) | POINT(9 1)
NEW:
SELECT
m.groundtruth, ps.p, ARRAY_AGG(m.anchor_id), ARRAY_AGG(m.id)
FROM
measurement m
JOIN
(SELECT unnest(point_array) AS p) AS ps
ON ST_DWithin(ps.p, m.groundtruth, 0.5)
GROUP BY groundtruth, ps.p
Actual result:
p | groundtruth | anchor_arr | id_arr
--------------------------------------------------
P1 | G1 | {1,3,2,..} | {9,8,11,..}
P1 | G2 | {4,3,5,..} | {1,8,23,..}
P1 | G3 | {6,8,9,..} | {12,7,6,..}
P2 | G1 | {6,6,2,..} | {15,2,10,..}
P2 | G4 | {7,9,1,..} | {5,4,3,..}
... | ... | ... | ...
So at the moment I get:
- every distinct inputValue-groundtruth-tuple
- for every tuple I get an array with all
anchor_id
corresponding thegroundtruth
part of the tuple - and an array of all
id
s corresponding thegroundtruth
-anchor_id
relation
Remember:
- two input values can 'select' the same
groundtruth
- a single
groundtruth
value can have multiple identicalanchor_id
s - each
groundtruth
-anchor_id
-tuple has a distinctid
So what's missing for completion?:
- I just need a random row for each
ps.p
- The two arrays belong to each other. Means: the order of the items inside is important!
- Those two arrays need to get filtered (randomly):
- For each
anchor_id
in the array that appears more than once: keep a random one and delete all other. This also means to remove the correspondingid
from theid
-array for every deletedanchor_id
- For each
来源:https://stackoverflow.com/questions/15091363/select-a-random-entry-from-a-group-after-grouping-by-a-value-not-column