Get latest child per parent from big table - query is too slow

问题

I have a query generated by Django's ORM, that is taking hours to run.

The report_rank table (50 million rows) is in a one to many relation to report_profile (100k rows). I'm trying to retrieve the latest report_rank for each report_profile.

I'm running Postgres 9.1 on an extra large Amazon EC2 server with plenty of available RAM (2GB/15GB used). Disk IO is pretty bad of course.

I have indexes on report_rank.created as well as on all foreign key fields.

What can I do to speed this query up? I'd be happy to try a different approach with the query if it will be performant, or to tune any database configuration parameters needed.

EXPLAIN 
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created"
     , MAX(T7."created") AS "max" 
FROM "report_rank" 
LEFT OUTER JOIN "report_site" 
  ON ("report_rank"."site_id" = "report_site"."id") 
INNER JOIN "report_profile" 
  ON ("report_site"."id" = "report_profile"."site_id") 
INNER JOIN "crm_client" 
  ON ("report_profile"."client_id" = "crm_client"."id") 
INNER JOIN "auth_user" 
  ON ("crm_client"."user_id" = "auth_user"."id") 
LEFT OUTER JOIN "report_rank" T7 
  ON ("report_site"."id" = T7."site_id") 
WHERE ("auth_user"."is_active" = True  AND "crm_client"."is_deleted" = False ) 
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created" 
HAVING MAX(T7."created") =  "report_rank"."created";

Output of EXPLAIN:

GroupAggregate  (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
  Filter: (max(t7.created) = report_rank.created)
  ->  Sort  (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
        Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
        ->  Hash Join  (cost=1323766.36..6107863.59 rows=4658113881 width=72)
              Hash Cond: (report_rank.site_id = report_site.id)
              ->  Seq Scan on report_rank  (cost=0.00..1076119.27 rows=48133327 width=64)
              ->  Hash  (cost=1312601.51..1312601.51 rows=893188 width=16)
                    ->  Hash Right Join  (cost=47050.38..1312601.51 rows=893188 width=16)
                          Hash Cond: (t7.site_id = report_site.id)
                          ->  Seq Scan on report_rank t7  (cost=0.00..1076119.27 rows=48133327 width=12)
                          ->  Hash  (cost=46692.28..46692.28 rows=28648 width=8)
                                ->  Nested Loop  (cost=2201.98..46692.28 rows=28648 width=8)
                                      ->  Hash Join  (cost=2201.98..5733.23 rows=28648 width=4)
                                            Hash Cond: (crm_client.user_id = auth_user.id)
                                            ->  Hash Join  (cost=2040.73..5006.71 rows=44606 width=8)
                                                  Hash Cond: (report_profile.client_id = crm_client.id)
                                                  ->  Seq Scan on report_profile  (cost=0.00..1706.09 rows=93009 width=8)
                                                  ->  Hash  (cost=1761.98..1761.98 rows=22300 width=8)
                                                        ->  Seq Scan on crm_client  (cost=0.00..1761.98 rows=22300 width=8)
                                                              Filter: (NOT is_deleted)
                                            ->  Hash  (cost=126.85..126.85 rows=2752 width=4)
                                                  ->  Seq Scan on auth_user  (cost=0.00..126.85 rows=2752 width=4)
                                                        Filter: is_active
                                      ->  Index Scan using report_site_pkey on report_site  (cost=0.00..1.42 rows=1 width=4)
                                            Index Cond: (id = report_profile.site_id)

回答1:

The major point is most likely that you JOIN and GROUP over everything just to get max(created). Get this value separately.

You mentioned all the indexes that are needed here: on report_rank.created and on the foreign keys. You are doing alright there. (If you are interested in better than "alright", keep reading!)

The LEFT JOIN report_site will be forced to a plain JOIN by the WHERE clause. I substituted a plain JOIN. I also simplified your syntax a lot.

Updated July 2015 with simpler, faster queries and smarter functions.

Solution for multiple rows

report_rank.created is not unique and you want all the latest rows.
Using the window function rank() in a subquery.

SELECT r.id, r.keyword_id, r.site_id
     , r.rank, r.url, r.competition
     , r.source, r.country, r.created  -- same as "max"
FROM  (
   SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
   FROM   report_rank r
   WHERE  EXISTS (
      SELECT *
      FROM   report_site    s
      JOIN   report_profile p ON p.site_id = s.id
      JOIN   crm_client     c ON c.id      = p.client_id
      JOIN   auth_user      u ON u.id      = c.user_id
      WHERE  s.id = r.site_id
      AND    u.is_active
      AND    c.is_deleted = FALSE
      )
   ) sub
WHERE  rnk = 1;

Why DESC NULLS LAST?

Why do NULL values come first when ordering DESC in a PostgreSQL query?

Solution for one row

If report_rank.created is unique or you are satisfied with any 1 row with max(created):

SELECT id, keyword_id, site_id
     , rank, url, competition
     , source, country, created  -- same as "max"
FROM   report_rank r
WHERE  EXISTS (
    SELECT 1
    FROM   report_site    s
    JOIN   report_profile p ON p.site_id = s.id
    JOIN   crm_client     c ON c.id      = p.client_id
    JOIN   auth_user      u ON u.id      = c.user_id
    WHERE  s.id = r.site_id
    AND    u.is_active
    AND    c.is_deleted = FALSE
   )
-- AND  r.created > f_report_rank_cap()
ORDER  BY r.created DESC NULLS LAST
LIMIT  1;

Should be faster, still. More options:

Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest record per user

Ultimate Speed with dynamically adjusted partial index

You may have noticed the commented part in the last query:

AND  r.created > f_report_rank_cap()

You mentioned 50 mio. rows, that's a lot. Here is a way to speed things up:

Create a simple IMMUTABLE function returning a timestamp that's guaranteed to be older than rows of interest while being as young as possible.
Create a partial index on younger rows only - based on this function.
Use a WHERE condition in queries that matches the index condition.
Create another function that updates these objects to the latest row with dynamic DDL. (Minus a secure margin in case the newest row(s) get deleted / deactivated - if that can happen)
Invoke this secondary function at off-times with a minimum of concurrent activity per cronjob or on demand. As often as you want, can't do harm, it just needs a short exclusive lock on the table.

Here is a complete working demo.
@erikcw, you'll have to activate the commented part as instructed below.

CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());

-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
  RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$;  -- or as high as you can safely bet.

-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE  created > f_report_rank_cap();

-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
  RETURNS void AS
$func$
DECLARE
   _secure_margin CONSTANT interval := interval '1 day';  -- adjust to your case
   _cap timestamp;  -- exclude older rows than this from partial index
BEGIN
   SELECT max(created) - _secure_margin
   FROM   report_rank
   WHERE  created > f_report_rank_cap() + _secure_margin
   /*  not needed for the demo; @erikcw needs to activate this
   AND    EXISTS (
     SELECT *
     FROM   report_site    s
     JOIN   report_profile p ON p.site_id = s.id
     JOIN   crm_client     c ON c.id      = p.client_id
     JOIN   auth_user      u ON u.id      = c.user_id
     WHERE  s.id = r.site_id
     AND    u.is_active
     AND    c.is_deleted = FALSE)
   */
   INTO   _cap;

   IF FOUND THEN
     -- recreate function
     EXECUTE format('
     CREATE OR REPLACE FUNCTION f_report_rank_cap()
       RETURNS timestamp LANGUAGE sql IMMUTABLE AS
     $y$SELECT %L::timestamp$y$', _cap);

     -- reindex
     REINDEX INDEX report_rank_recent_idx;
   END IF;
END
$func$  LANGUAGE plpgsql;

COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
    and reindex partial index on report_rank.';

Call:

SELECT f_report_rank_set_cap();

See:

SELECT f_report_rank_cap();

Uncomment the clause AND r.created > f_report_rank_cap() in the query above and observe the difference. Verify that the index gets used with EXPLAIN ANALYZE.

The manual on concurrency and REINDEX:

To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command.

回答2:

-- modelled after Erwin's version
-- does the x query really return only one row?

SELECT r.id, r.keyword_id, r.site_id
    , r.rank, r.url, r.competition, r.source
    , r.country, r.created, x.max_created
-- UPDATE3: I forgot one, too
FROM report_rank r
LEFT   JOIN report_site s  ON (r.site_id = s.id) 
JOIN   report_profile   p  ON (s.id = p.site_id) 
JOIN   crm_client       c  ON (p.client_id = c.id) 
JOIN   auth_user        u  ON (c.user_id = u.id)
-- UPDATE2: t7 has left the building
WHERE  u.is_active
AND    c.is_deleted = FALSE
AND NOT EXISTS (SELECT * FROM report_rank x
       -- WHERE 1=1 -- uncorrelated subquery ??
       -- UPDATE1: no it's not. Erwin seems to have forgotten the t7 join
       WHERE r.id = x.site_id
       AND x.created > r.created
       ) 
;

回答3:

Alternative interpretation

I was busy optimizing the query you presented and missed a piece of what you wrote:

I'm trying to retrieve the latest report_rank for each report_profile.

Which is something different entirely to what your query is trying to do.

First, let me demonstrate how I distilled the query from what you posted.
I removed the "" and noise words, used aliases and trimmed the format, arriving at this:

SELECT r.id, r.keyword_id, r.site_id, r.rank, r.url, r.competition
      ,r.source, r.country, r.created
      ,MAX(t7.created) AS max 
FROM   report_rank      r
LEFT   JOIN report_site s  ON (s.id      = r.site_id) 
JOIN   report_profile   p  ON (p.site_id = s.id) 
JOIN   crm_client       c  ON (c.id      = p.client_id) 
JOIN   auth_user        u  ON (u.id      = c.user_id) 
LEFT   JOIN report_rank t7 ON (t.site_id = s.id) 
WHERE  u.is_active
AND    c.is_deleted = False
GROUP  BY
       r.id
      ,r.keyword_id
      ,r.site_id
      ,r.rank
      ,r.url, r.competition
      ,r.source
      ,r.country
      ,r.created 
HAVING MAX(t7.created) =  r.created;

What you are trying to do with T7 and HAVING cannot work on principal, I pruned that.
LEFT JOIN will be forced to a plain JOIN in both cases. I substituted accordingly.
From your query I deduced that report_site is in a 1:n relationship to both report_rank and report_profile, and that's how those two are linked. Therefore, report_profile that belong to the same report_site share the same latest report_rank. You might as well group by report_site. But I sticked to the question asked.
I eliminated report_site from the query. It's irrelevant, as long as it exists, which I assert.
Since PostgreSQL 9.1 it's enough to GROUP BY the primary key per table. I simplified accordingly.
For simplification I selected all columns of report_rank

With all that, I arrived at this basic query:

SELECT r.*
FROM   report_rank    r
JOIN   report_profile p USING (site_id) 
JOIN   crm_client     c ON (c.id = p.client_id) 
JOIN   auth_user      u ON (u.id = c.user_id) 
WHERE  u.is_active
AND    c.is_deleted = FALSE
GROUP  BY r.id;

Building on this, I created a solution with the ...

Latest `report_rank` for each `report_profile`

WITH p AS (
    SELECT p.id AS profile_id
          ,p.site_id
    FROM   report_profile p
    WHERE  EXISTS (
        SELECT *
        FROM   crm_client c
        JOIN   auth_user  u ON u.id = c.user_id
        WHERE  c.id = p.client_id
        AND    c.is_deleted = FALSE
        AND    u.is_active
        )
    ) x AS (
    SELECT p.profile_id
          ,r.*
    FROM   p
    JOIN   report_rank r USING (site_id)
    )
SELECT *
FROM   x
WHERE  NOT EXISTS (
    SELECT *
    FROM   x r
    WHERE  r.profile_id = x.profile_id
    AND    r.created > x.created
    );

I assume there is a report_profile.id though you did not mention it.
In the 1st CTE I get a unique SET of valid profiles.
In the 2nd CTE I join with report_rank to produce the resulting rows
In the final query I eliminate all but the latest report_rank per report_profile
Can be one or more row, if created is not unique.
The solution with the partial index in my other answer is not applicable with this variant.

Finally, advice for performance optimization from the PostgreSQL wiki: