Get latest child per parent from big table - query is too slow

落花浮王杯 提交于 2019-12-09 22:20:01

问题


I have a query generated by Django's ORM, that is taking hours to run.

The report_rank table (50 million rows) is in a one to many relation to report_profile (100k rows). I'm trying to retrieve the latest report_rank for each report_profile.

I'm running Postgres 9.1 on an extra large Amazon EC2 server with plenty of available RAM (2GB/15GB used). Disk IO is pretty bad of course.

I have indexes on report_rank.created as well as on all foreign key fields.

What can I do to speed this query up? I'd be happy to try a different approach with the query if it will be performant, or to tune any database configuration parameters needed.

EXPLAIN 
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created"
     , MAX(T7."created") AS "max" 
FROM "report_rank" 
LEFT OUTER JOIN "report_site" 
  ON ("report_rank"."site_id" = "report_site"."id") 
INNER JOIN "report_profile" 
  ON ("report_site"."id" = "report_profile"."site_id") 
INNER JOIN "crm_client" 
  ON ("report_profile"."client_id" = "crm_client"."id") 
INNER JOIN "auth_user" 
  ON ("crm_client"."user_id" = "auth_user"."id") 
LEFT OUTER JOIN "report_rank" T7 
  ON ("report_site"."id" = T7."site_id") 
WHERE ("auth_user"."is_active" = True  AND "crm_client"."is_deleted" = False ) 
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created" 
HAVING MAX(T7."created") =  "report_rank"."created";

Output of EXPLAIN:

GroupAggregate  (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
  Filter: (max(t7.created) = report_rank.created)
  ->  Sort  (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
        Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
        ->  Hash Join  (cost=1323766.36..6107863.59 rows=4658113881 width=72)
              Hash Cond: (report_rank.site_id = report_site.id)
              ->  Seq Scan on report_rank  (cost=0.00..1076119.27 rows=48133327 width=64)
              ->  Hash  (cost=1312601.51..1312601.51 rows=893188 width=16)
                    ->  Hash Right Join  (cost=47050.38..1312601.51 rows=893188 width=16)
                          Hash Cond: (t7.site_id = report_site.id)
                          ->  Seq Scan on report_rank t7  (cost=0.00..1076119.27 rows=48133327 width=12)
                          ->  Hash  (cost=46692.28..46692.28 rows=28648 width=8)
                                ->  Nested Loop  (cost=2201.98..46692.28 rows=28648 width=8)
                                      ->  Hash Join  (cost=2201.98..5733.23 rows=28648 width=4)
                                            Hash Cond: (crm_client.user_id = auth_user.id)
                                            ->  Hash Join  (cost=2040.73..5006.71 rows=44606 width=8)
                                                  Hash Cond: (report_profile.client_id = crm_client.id)
                                                  ->  Seq Scan on report_profile  (cost=0.00..1706.09 rows=93009 width=8)
                                                  ->  Hash  (cost=1761.98..1761.98 rows=22300 width=8)
                                                        ->  Seq Scan on crm_client  (cost=0.00..1761.98 rows=22300 width=8)
                                                              Filter: (NOT is_deleted)
                                            ->  Hash  (cost=126.85..126.85 rows=2752 width=4)
                                                  ->  Seq Scan on auth_user  (cost=0.00..126.85 rows=2752 width=4)
                                                        Filter: is_active
                                      ->  Index Scan using report_site_pkey on report_site  (cost=0.00..1.42 rows=1 width=4)
                                            Index Cond: (id = report_profile.site_id)

回答1:


The major point is most likely that you JOIN and GROUP over everything just to get max(created). Get this value separately.

You mentioned all the indexes that are needed here: on report_rank.created and on the foreign keys. You are doing alright there. (If you are interested in better than "alright", keep reading!)

The LEFT JOIN report_site will be forced to a plain JOIN by the WHERE clause. I substituted a plain JOIN. I also simplified your syntax a lot.

Updated July 2015 with simpler, faster queries and smarter functions.

Solution for multiple rows

report_rank.created is not unique and you want all the latest rows.
Using the window function rank() in a subquery.

SELECT r.id, r.keyword_id, r.site_id
     , r.rank, r.url, r.competition
     , r.source, r.country, r.created  -- same as "max"
FROM  (
   SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
   FROM   report_rank r
   WHERE  EXISTS (
      SELECT *
      FROM   report_site    s
      JOIN   report_profile p ON p.site_id = s.id
      JOIN   crm_client     c ON c.id      = p.client_id
      JOIN   auth_user      u ON u.id      = c.user_id
      WHERE  s.id = r.site_id
      AND    u.is_active
      AND    c.is_deleted = FALSE
      )
   ) sub
WHERE  rnk = 1;

Why DESC NULLS LAST?

  • Why do NULL values come first when ordering DESC in a PostgreSQL query?

Solution for one row

If report_rank.created is unique or you are satisfied with any 1 row with max(created):

SELECT id, keyword_id, site_id
     , rank, url, competition
     , source, country, created  -- same as "max"
FROM   report_rank r
WHERE  EXISTS (
    SELECT 1
    FROM   report_site    s
    JOIN   report_profile p ON p.site_id = s.id
    JOIN   crm_client     c ON c.id      = p.client_id
    JOIN   auth_user      u ON u.id      = c.user_id
    WHERE  s.id = r.site_id
    AND    u.is_active
    AND    c.is_deleted = FALSE
   )
-- AND  r.created > f_report_rank_cap()
ORDER  BY r.created DESC NULLS LAST
LIMIT  1;

Should be faster, still. More options:

  • Select first row in each GROUP BY group?

  • Optimize GROUP BY query to retrieve latest record per user

Ultimate Speed with dynamically adjusted partial index

You may have noticed the commented part in the last query:

AND  r.created > f_report_rank_cap()

You mentioned 50 mio. rows, that's a lot. Here is a way to speed things up:

  • Create a simple IMMUTABLE function returning a timestamp that's guaranteed to be older than rows of interest while being as young as possible.
  • Create a partial index on younger rows only - based on this function.
  • Use a WHERE condition in queries that matches the index condition.
  • Create another function that updates these objects to the latest row with dynamic DDL. (Minus a secure margin in case the newest row(s) get deleted / deactivated - if that can happen)
  • Invoke this secondary function at off-times with a minimum of concurrent activity per cronjob or on demand. As often as you want, can't do harm, it just needs a short exclusive lock on the table.

Here is a complete working demo.
@erikcw, you'll have to activate the commented part as instructed below.

CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());

-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
  RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$;  -- or as high as you can safely bet.

-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE  created > f_report_rank_cap();

-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
  RETURNS void AS
$func$
DECLARE
   _secure_margin CONSTANT interval := interval '1 day';  -- adjust to your case
   _cap timestamp;  -- exclude older rows than this from partial index
BEGIN
   SELECT max(created) - _secure_margin
   FROM   report_rank
   WHERE  created > f_report_rank_cap() + _secure_margin
   /*  not needed for the demo; @erikcw needs to activate this
   AND    EXISTS (
     SELECT *
     FROM   report_site    s
     JOIN   report_profile p ON p.site_id = s.id
     JOIN   crm_client     c ON c.id      = p.client_id
     JOIN   auth_user      u ON u.id      = c.user_id
     WHERE  s.id = r.site_id
     AND    u.is_active
     AND    c.is_deleted = FALSE)
   */
   INTO   _cap;

   IF FOUND THEN
     -- recreate function
     EXECUTE format('
     CREATE OR REPLACE FUNCTION f_report_rank_cap()
       RETURNS timestamp LANGUAGE sql IMMUTABLE AS
     $y$SELECT %L::timestamp$y$', _cap);

     -- reindex
     REINDEX INDEX report_rank_recent_idx;
   END IF;
END
$func$  LANGUAGE plpgsql;

COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
    and reindex partial index on report_rank.';

Call:

SELECT f_report_rank_set_cap();

See:

SELECT f_report_rank_cap();

Uncomment the clause AND r.created > f_report_rank_cap() in the query above and observe the difference. Verify that the index gets used with EXPLAIN ANALYZE.

The manual on concurrency and REINDEX:

To build the index without interfering with production you should drop the index and reissue the CREATE INDEX CONCURRENTLY command.




回答2:


-- modelled after Erwin's version
-- does the x query really return only one row?

SELECT r.id, r.keyword_id, r.site_id
    , r.rank, r.url, r.competition, r.source
    , r.country, r.created, x.max_created
-- UPDATE3: I forgot one, too
FROM report_rank r
LEFT   JOIN report_site s  ON (r.site_id = s.id) 
JOIN   report_profile   p  ON (s.id = p.site_id) 
JOIN   crm_client       c  ON (p.client_id = c.id) 
JOIN   auth_user        u  ON (c.user_id = u.id)
-- UPDATE2: t7 has left the building
WHERE  u.is_active
AND    c.is_deleted = FALSE
AND NOT EXISTS (SELECT * FROM report_rank x
       -- WHERE 1=1 -- uncorrelated subquery ??
       -- UPDATE1: no it's not. Erwin seems to have forgotten the t7 join
       WHERE r.id = x.site_id
       AND x.created > r.created
       ) 
;



回答3:


Alternative interpretation

I was busy optimizing the query you presented and missed a piece of what you wrote:

I'm trying to retrieve the latest report_rank for each report_profile.

Which is something different entirely to what your query is trying to do.

First, let me demonstrate how I distilled the query from what you posted.
I removed the "" and noise words, used aliases and trimmed the format, arriving at this:

SELECT r.id, r.keyword_id, r.site_id, r.rank, r.url, r.competition
      ,r.source, r.country, r.created
      ,MAX(t7.created) AS max 
FROM   report_rank      r
LEFT   JOIN report_site s  ON (s.id      = r.site_id) 
JOIN   report_profile   p  ON (p.site_id = s.id) 
JOIN   crm_client       c  ON (c.id      = p.client_id) 
JOIN   auth_user        u  ON (u.id      = c.user_id) 
LEFT   JOIN report_rank t7 ON (t.site_id = s.id) 
WHERE  u.is_active
AND    c.is_deleted = False
GROUP  BY
       r.id
      ,r.keyword_id
      ,r.site_id
      ,r.rank
      ,r.url, r.competition
      ,r.source
      ,r.country
      ,r.created 
HAVING MAX(t7.created) =  r.created;
  • What you are trying to do with T7 and HAVING cannot work on principal, I pruned that.
  • LEFT JOIN will be forced to a plain JOIN in both cases. I substituted accordingly.
  • From your query I deduced that report_site is in a 1:n relationship to both report_rank and report_profile, and that's how those two are linked. Therefore, report_profile that belong to the same report_site share the same latest report_rank. You might as well group by report_site. But I sticked to the question asked.
  • I eliminated report_site from the query. It's irrelevant, as long as it exists, which I assert.
  • Since PostgreSQL 9.1 it's enough to GROUP BY the primary key per table. I simplified accordingly.
  • For simplification I selected all columns of report_rank

With all that, I arrived at this basic query:

SELECT r.*
FROM   report_rank    r
JOIN   report_profile p USING (site_id) 
JOIN   crm_client     c ON (c.id = p.client_id) 
JOIN   auth_user      u ON (u.id = c.user_id) 
WHERE  u.is_active
AND    c.is_deleted = FALSE
GROUP  BY r.id;

Building on this, I created a solution with the ...

Latest report_rank for each report_profile

WITH p AS (
    SELECT p.id AS profile_id
          ,p.site_id
    FROM   report_profile p
    WHERE  EXISTS (
        SELECT *
        FROM   crm_client c
        JOIN   auth_user  u ON u.id = c.user_id
        WHERE  c.id = p.client_id
        AND    c.is_deleted = FALSE
        AND    u.is_active
        )
    ) x AS (
    SELECT p.profile_id
          ,r.*
    FROM   p
    JOIN   report_rank r USING (site_id)
    )
SELECT *
FROM   x
WHERE  NOT EXISTS (
    SELECT *
    FROM   x r
    WHERE  r.profile_id = x.profile_id
    AND    r.created > x.created
    );
  • I assume there is a report_profile.id though you did not mention it.
  • In the 1st CTE I get a unique SET of valid profiles.
  • In the 2nd CTE I join with report_rank to produce the resulting rows
  • In the final query I eliminate all but the latest report_rank per report_profile
  • Can be one or more row, if created is not unique.
  • The solution with the partial index in my other answer is not applicable with this variant.

Finally, advice for performance optimization from the PostgreSQL wiki:

  • Tune the server settings, be sure to allocate a lot of memory in the right place.
  • Check your hardware.


来源:https://stackoverflow.com/questions/8087861/get-latest-child-per-parent-from-big-table-query-is-too-slow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!