I\'m trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along w
There are always 2 things to consider when optimising queries:
A few observations:
You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.
Your subqueries are filtering to the same date range as generate_series. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time)?
NOTE: We would actually very much like to use an index on Body.body_time
ORDER BY within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.
Use of LEFT JOIN in your subqueries is inappropriate. Assuming you're using ANSI conventions for NULL behaviour (and you should be), any outer joins to envelope would return envelope_command=NULL, and these would consequently be excluded by the condition envelope_command=?.
Subqueries o and i are almost identical save for the envelope_command value. This forces the optimser to scan the same underlying tables twice. You can use a pivot table technique to join to the data once, and split the values into 2 columns.
Try the following which uses the pivot technique:
SELECT p.period,
/*The pivot technique in action...*/
SUM(
CASE WHEN envelope_command = 1 THEN body_size
ELSE 0
END) AS Outbound,
SUM(
CASE WHEN envelope_command = 2 THEN body_size
ELSE 0
END) AS Inbound
FROM (
SELECT date '2009-10-01' + s.day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
/*The left JOIN is justified to ensure ALL generated dates are returned
Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
LEFT OUTER JOIN (
SELECT b.body_size,
b.body_time,
e.envelope_command
FROM body AS b
INNER JOIN envelope e
ON e.message_id = b.message_id
WHERE envelope_command IN (1, 2)
) d
/*The expressions below allow the optimser to use an index on body_time if
the statistics indicate it would be beneficial*/
ON d.body_time >= p.period
AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period
EDIT: Added filter suggested by Tom H.