I have the following query which fetches the id
of the latest N observations
for each station
:
SELECT id
FROM (
SELE
This is a good anwser only if you are not required to query up-to-date live data.
Preparation (requires postgresql 9.3)
drop materialized view test;
create materialized view test as select * from (
SELECT station_id, id, created_at,
row_number() OVER(
PARTITION BY station_id
ORDER BY created_at DESC
) as rn
FROM (
SELECT
station_id,
id,
created_at
FROM observations
) s
) q WHERE q.rn <= 100 -- use a value that will be your max limit number for further queries
ORDER BY station_id, rn DESC ;
create index idx_test on test(station_id,rn,created_at);
How to query data:
select * from test where rn<10 order by station_id,created_at;
Your original query was 281 ms on my machine and this new was 15 ms.
How to update the view with fresh data:
refresh materialized view test;
I have another solution that does not require materialized view and works with live, up-to-date data. But given that you don't need up-to-date data, this materialized view is much more efficient.
Assuming at least Postgres 9.3.
First, a multicolumn index will help:
CREATE INDEX observations_special_idx
ON observations(station_id, created_at DESC, id)
created_at DESC
is a slightly better fit, but the index would still be scanned backwards at almost the same speed without DESC
.
Assuming created_at
is defined NOT NULL
, else consider DESC NULLS LAST
in index and query:
The last column id
is only useful if you get an index-only scan out of it, which probably won't work if you add lots of new rows constantly. In this case, remove id
from the index.
Simplify your query, the inner subselect doesn't help:
SELECT id
FROM (
SELECT station_id, id, created_at
, row_number() OVER (PARTITION BY station_id
ORDER BY created_at DESC) AS rn
FROM observations
) s
WHERE rn <= #{n} -- your limit here
ORDER BY station_id, created_at DESC;
Should be a bit faster, but still slow.
station_id
id defined as NOT NULL
.To be really fast, you need the equivalent of a loose index scan (not implemented in Postgres, yet). Related answer:
If you have a separate table of stations
(which seems likely), you can emulate this with JOIN LATERAL (Postgres 9.3+):
SELECT o.id
FROM stations s
CROSS JOIN LATERAL (
SELECT o.id
FROM observations o
WHERE o.station_id = s.station_id -- lateral reference
ORDER BY o.created_at DESC
LIMIT #{n} -- your limit here
) o
ORDER BY s.station_id, o.created_at DESC;
If you don't have a table of stations
, the next best thing would be to create and maintain one. Possibly add a foreign key reference to enforce relational integrity.
If that's not an option, you can distill such a table on the fly. Simple options would be:
SELECT DISTINCT station_id FROM observations;
SELECT station_id FROM observations GROUP BY 1;
But either would need a sequential scan and be slow. Make Postgres use above index (or any btree index with station_id
as leading column) with a recursive CTE:
WITH RECURSIVE stations AS (
( -- extra pair of parentheses ...
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
) -- ... is required!
UNION ALL
SELECT (SELECT o.station_id
FROM observations o
WHERE o.station_id > s.station_id
ORDER BY o.station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL -- serves as break condition
)
SELECT station_id
FROM stations
WHERE station_id IS NOT NULL; -- remove dangling row with NULL
Use that as drop-in replacement for the stations
table in the above simple query:
WITH RECURSIVE stations AS (
(
SELECT station_id
FROM observations
ORDER BY station_id
LIMIT 1
)
UNION ALL
SELECT (SELECT o.station_id
FROM observations o
WHERE o.station_id > s.station_id
ORDER BY o.station_id
LIMIT 1)
FROM stations s
WHERE s.station_id IS NOT NULL
)
SELECT o.id
FROM stations s
CROSS JOIN LATERAL (
SELECT o.id, o.created_at
FROM observations o
WHERE o.station_id = s.station_id
ORDER BY o.created_at DESC
LIMIT #{n} -- your limit here
) o
WHERE s.station_id IS NOT NULL
ORDER BY s.station_id, o.created_at DESC;
This should still be faster than what you had by orders of magnitude.
SQL Fiddle here (9.6)
db<>fiddle here