问题
I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.
I looked at several other related posts but could not get desired output:
- Postgres - how to return rows with 0 count for missing data?
- Postgresql group month wise with missing values
- Best way to count records by arbitrary time intervals in Rails+Postgres
Here is the query:
SELECT
upper(trim(t.full_name)) AS teacher
, date_trunc('month', s.study_dt)::date AS study_month
, r.room_code AS room
, COUNT(1) AS study_count
FROM
studies AS s
LEFT OUTER JOIN rooms AS r ON r.id = s.room_id
LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id
LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id
WHERE
s.study_dt BETWEEN now() - interval '13 month' AND now()
AND s.study_dt IS NOT NULL
GROUP BY
teacher
, study_month
, room
ORDER BY
teacher
, study_month
, room;
The output I get:
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2
But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):
"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...
To get the missing year-months, I tried left outer join on using time series and joining on time_range.year_month = study_month
, but it didn't work.
SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range
So, I'd like to know how to 'fill in the gaps' for
a) both year-month and room and, as a bonus: b) just a year-month.
The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):
teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...
回答1:
Based on some assumptions (ambiguities in the question) I suggest:
SELECT upper(trim(t.full_name)) AS teacher
, m.study_month
, r.room_code AS room
, count(s.room_id) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') m(study_month)
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
studies s
JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id -- INNER JOIN!
) ON tc.teacher_id = t.id
AND s.study_dt >= m.study_month
AND s.study_dt < m.study_month + interval '1 month' -- sargable!
AND s.room_id = r.id
GROUP BY t.id, m.study_month, r.id -- id is PK of respective tables
ORDER BY t.id, m.study_month, r.id;
Major points
Build a grid of all desired combinations with
CROSS JOIN
. And thenLEFT JOIN
to existing rows. Related:- array_agg group by and null
- Get created as well as deleted entries of last week
In your case, it's a join of several tables, so I use parentheses in the
FROM
list toLEFT JOIN
to the result ofINNER JOIN
within the parentheses. It would be incorrect toLEFT JOIN
to each table separately, because you would include hits on partial matches and get potentially incorrect counts.Assuming referential integrity and working with PK columns directly, we don't need to include
rooms
andteachers
on the left side a second time. But we still have a join of two tables (studies
andteacher_contacts
). The role ofteacher_contacts
is unclear to me. Normally, I would expect a relationship betweenstudies
andteachers
directly. Might be further simplified ...We need to count a non-null column on the left side to get the desired counts. Like
count(s.room_id)
To keep this fast for big tables, make sure your predicates are sargable. And add matching indexes.
The column
teacher
is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still usingteacher
for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.You want:
the past 12 months (including current month).
So start with
date_trunc('month', now() - interval '12 month'
(not 13). That's rounding down the start already and does what you want - more accurately than your original query.
Since you mentioned slow performance, depending on actual table definitions and data distribution, it's probably faster to aggregate first and join later, like in this related answer:
- Postgres - how to return rows with 0 count for missing data?
SELECT upper(trim(t.full_name)) AS teacher
, m.mon AS study_month
, r.room_code AS room
, COALESCE(s.ct, 0) AS study_count
FROM teachers t
CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12!
, date_trunc('month', now())
, interval '1 month') mon
CROSS JOIN rooms r
LEFT JOIN ( -- parentheses!
SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
FROM studies s
JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id
WHERE s.study_dt >= date_trunc('month', now() - interval '12 month') -- sargable
GROUP BY 1, 2, 3
) s ON s.teacher_id = t.id
AND s.mon = m.mon
AND s.room_id = r.id
ORDER BY 1, 2, 3;
About your closing remark:
the dataset would be fed to a pivot library ... (could not do this in PG directly)
Chances are you can use the two-parameter form of crosstab()
to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:
- PostgreSQL Crosstab Query
回答2:
You need to generate all the rows using a cross join
and then join in studies
and do an aggregation to get the count.
The resulting query should look like this:
select t.teacher, d.mon, r.room_code, count(s.teacher_contact_id)
from teachers t cross join
rooms r cross join
generate_series(date_trunc('month', now() - interval '13 month',
date_trunc('month', now()),
interval '1 month'
) d(mon) left join
(select distinct date_trunc('month', s.study_dt)::date as mon) d left join
teacher_contacts tc
on tc.teacher_id = t.id left join
studies s
on tc.id = s.teacher_contact_id and
date_trunc('month', s.study_dt) = d.mon
group by t.teacher, d.mon, r.room_code;
来源:https://stackoverflow.com/questions/38332433/how-to-include-missing-data-for-multiple-groupings-within-the-time-span