Optimizing multiple joins

后端 未结 3 578
悲&欢浪女
悲&欢浪女 2021-02-04 07:14

I\'m trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along w

3条回答
  •  我寻月下人不归
    2021-02-04 07:26

    There are always 2 things to consider when optimising queries:

    • What indexes can be used (you may need to create indexes)
    • How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)

    A few observations:

    • You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.

    • Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?

    • NOTE: We would actually very much like to use an index on Body.body_time

    • ORDER BY within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.

    • Use of LEFT JOIN in your subqueries is inappropriate. Assuming you're using ANSI conventions for NULL behaviour (and you should be), any outer joins to envelope would return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.

    • Subqueries o and i are almost identical save for the envelope_command value. This forces the optimser to scan the same underlying tables twice. You can use a pivot table technique to join to the data once, and split the values into 2 columns.

    Try the following which uses the pivot technique:

    SELECT  p.period,
            /*The pivot technique in action...*/
            SUM(
            CASE WHEN envelope_command = 1 THEN body_size
            ELSE 0
            END) AS Outbound,
            SUM(
            CASE WHEN envelope_command = 2 THEN body_size
            ELSE 0
            END) AS Inbound
    FROM    (
            SELECT  date '2009-10-01' + s.day AS period
            FROM    generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
            ) AS p 
            /*The left JOIN is justified to ensure ALL generated dates are returned
              Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
            LEFT OUTER JOIN (
            SELECT  b.body_size,
                    b.body_time,
                    e.envelope_command
            FROM    body AS b 
                    INNER JOIN envelope e 
                      ON e.message_id = b.message_id 
            WHERE   envelope_command IN (1, 2)
            ) d
              /*The expressions below allow the optimser to use an index on body_time if 
                the statistics indicate it would be beneficial*/
              ON d.body_time >= p.period
             AND d.body_time < p.period + INTERVAL '1 DAY'
    GROUP BY p.Period
    ORDER BY p.Period
    

    EDIT: Added filter suggested by Tom H.

提交回复
热议问题