How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?

问题

(AKA - With a query and data very similar to question "Selecting rows ordered by some column and distinct on another", how can I get it to run fast). Postgres 11.

I have table prediction with (article_id, prediction_date, predicted_as, article_published_date) that represents the output from a classifier over a set of articles.

New articles are frequently added to a separate table (Represented by the FK article_id), and new predictions are added as we tune our classifier.

Sample data:

| id      | article_id |  predicted_as | prediction_date | article_published_date
| 1009381 | 362718     |  negative     | 2018-07-27      | 2018-06-26
| 1009382 | 362718     |  positive     | 2018-08-12      | 2018-06-26
| 1009383 | 362719     |  positive     | 2018-08-13      | 2010-09-22
| 1009384 | 362719     |  positive     | 2018-09-28      | 2010-09-22
| 1009385 | 362719     |  negative     | 2018-10-01      | 2010-09-22

Create table script:

create table prediction
(
    id serial not null
        constraint prediction_pkey
            primary key,
    article_id integer not null
        constraint prediction_article_id_fkey
            references article,
    predicted_as classifiedas not null,
    prediction_date date not null,
    article_published_date date not null
);

create index prediction_article_id_prediction_date_idx
    on prediction (article_id asc, prediction_date desc);

We frequently want to view the most recent classification for each article. In order to do so we use:

SELECT DISTINCT ON (article_id) article_id, id, article_published_date
FROM prediction
ORDER BY article_id, prediction_date desc

which returns something like:

| id     | article_id |  predicted_as | prediction_date | article_published_date
| 120950 | 1          | negative      | 2018-06-29      | 2018-03-25
| 120951 | 2          | negative      | 2018-06-29      | 2018-03-19

With an index on (article_id, prediciton_date desc), this query runs very quickly (~15ms). This is the explain plan:

Unique  (cost=0.56..775374.53 rows=1058394 width=20)
  ->  Index Scan using prediction_article_id_prediction_date_id_idx on prediction  (cost=0.56..756071.98 rows=7721023 width=20)

So far so good.

The problem occurs when I want to sort this result by the article_published_field. E.g:

explain (analyze, buffers)
select *
  from (
         select distinct on (article_id) article_id, id, article_published_date
         from prediction
         order by article_id, prediction_date desc
       ) most_recent_predictions
  order by article_published_date desc
  limit 3;

This works, but the query takes ~3-4 seconds to run, making it too slow to use directly to respond to a web request.

Here is the explain plan:

Limit  (cost=558262.52..558262.53 rows=3 width=12) (actual time=4748.977..4748.979 rows=3 loops=1)
  Buffers: shared hit=7621849 read=9051
  ->  Sort  (cost=558262.52..560851.50 rows=1035593 width=12) (actual time=4748.975..4748.976 rows=3 loops=1)
        Sort Key: most_recent_predictions.article_published_date DESC
        Sort Method: top-N heapsort  Memory: 25kB
        Buffers: shared hit=7621849 read=9051
        ->  Subquery Scan on most_recent_predictions  (cost=0.43..544877.67 rows=1035593 width=12) (actual time=0.092..4508.464 rows=1670807 loops=1)
              Buffers: shared hit=7621849 read=9051
              ->  Result  (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.092..4312.916 rows=1670807 loops=1)
                    Buffers: shared hit=7621849 read=9051
                    ->  Unique  (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.090..4056.644 rows=1670807 loops=1)
                          Buffers: shared hit=7621849 read=9051
                          ->  Index Scan using prediction_article_id_prediction_date_idx on prediction  (cost=0.43..515295.09 rows=7690662 width=16) (actual time=0.089..3248.250 rows=7690662 loops=1)
                                Buffers: shared hit=7621849 read=9051
Planning Time: 0.130 ms
Execution Time: 4749.007 ms

Is there any way to make this query run more quickly, or will I have to resort to refreshing a materialized view or setting up a trigger system to get this data quickly?

For reference:

the prediction table has 7.7M rows
there are 1.7M distinct article_ids in the prediction table
there is an index on (article_id, prediciton_date desc) as well as one on article_published_date desc
VACUUM ANALYSE has been run

回答1:

I wonder if you can make this work:

select article_id, id, article_published_date
from prediction p
where p.prediction_date = (select max(p2.prediction_date)
                           from prediction p2
                           where p2.article_id = p.article_id
                          )
order by article_published_date desc;

Then use these two indexes:

(article_published_date desc, prediction_date, article_id, id)
(article_id, prediction_date desc).

回答2:

One thing that you could try is to use window function ROW_NUMBER() OVER(...) instead of DISTINCT ON() (which implies constraints on the ORDER BY clause). This method is functionaly equivalent to your second query, and might be able to take advantage of exising indexes:

SELECT *
FROM (
    SELECT 
        article_id, 
        id, 
        article_published_date,
        ROW_NUMBER() OVER(PARTITION BY article_id ORDER BY prediction_date DESC) rn
    FROM prediction 
) x WHERE rn = 1
ORDER BY article_published_date DESC
LIMIT 3;

Demo on DB Fiddle.

回答3:

While you just want a trivially small number of result rows (LIMIT 3 in your example), and if there is any positive correlation between article_published_date and prediction_date, this query should be radically faster as it only has to scan few tuples from the top of the added index (and recheck with the 2nd index):

Have these two indexes:

CREATE INDEX ON prediction (article_published_date DESC, prediction_date DESC, article_id DESC);

CREATE INDEX ON prediction (article_id, prediction_date DESC);

Recursive Query:

WITH RECURSIVE cte AS (
   (
   SELECT p.article_published_date, p.article_id, p.prediction_date, ARRAY[p.article_id] AS a_ids
   FROM   prediction p
   WHERE  NOT EXISTS (  -- no later row for same article
      SELECT FROM prediction
      WHERE  article_id = p.article_id
      AND    prediction_date > p.prediction_date
      )
   ORDER  BY p.article_published_date DESC, p.prediction_date DESC, p.article_id DESC
   LIMIT  1
   )
   UNION ALL
   SELECT p.article_published_date, p.article_id, p.prediction_date, a_ids || p.article_id
   FROM   cte c, LATERAL (
      SELECT p.article_published_date, p.article_id, p.prediction_date
      FROM   prediction p
      WHERE (p.article_published_date, p.prediction_date, p.article_id)
          < (c.article_published_date, c.prediction_date, c.article_id)
      AND    p.article_id <> ALL(a_ids)   -- different article
      AND    NOT EXISTS (                 -- no later row for same article
         SELECT FROM prediction
         WHERE  article_id = p.article_id
         AND    prediction_date > p.prediction_date
         )
      ORDER  BY p.article_published_date DESC, p.prediction_date DESC, p.article_id DESC
      LIMIT  1
      ) p
   )
SELECT article_published_date, article_id, prediction_date
FROM   cte
LIMIT  3;

Here is a plpgsql solution doing the same, probably slightly faster:

CREATE OR REPLACE FUNCTION f_top_n_predictions(_n int = 3)
  RETURNS TABLE (_article_published_date date, _article_id int, _prediction_date date) AS
$func$
DECLARE
   a_ids int[];
BEGIN
   FOR _article_published_date, _article_id, _prediction_date IN
      SELECT article_published_date, article_id, prediction_date
      FROM   prediction
      ORDER  BY article_published_date DESC, prediction_date DESC, article_id DESC
   LOOP
      IF _article_id = ANY(a_ids)
      OR EXISTS (SELECT FROM prediction p
                 WHERE  p.article_id = _article_id
                 AND    p.prediction_date > _prediction_date) THEN
         -- do nothing         
      ELSE
         RETURN NEXT;
         a_ids := a_ids || _article_id;
         EXIT WHEN cardinality(a_ids) >= _n;
      END IF;
   END LOOP;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_top_n_predictions();

I'll add explanation if it works for you, since the explanation is more work than the query itself.

Apart from that, with more than a few predictions per article, and with an additional table article, this query becomes a contender:

SELECT p.*
FROM   article a
CROSS  JOIN LATERAL (
   SELECT p.article_published_date, p.article_id, p.prediction_date
   FROM   prediction p
   WHERE  p.article_id = a.id
   ORDER  BY p.prediction_date DESC
   LIMIT  1
   ) p
ORDER  BY p.article_published_date DESC;

But you don't need this if the query above does the job. Gets interesting for a bigger or no LIMIT.

Basics: