How to include missing data for multiple groupings within the time span?

问题

I have below referenced query which groups studies counts by teacher, study year-month, and room for the past 12 months (including current month). The result I get is correct, however, I would like to include rows with zero counts for when the data is missing.

I looked at several other related posts but could not get desired output:

Postgres - how to return rows with 0 count for missing data?
Postgresql group month wise with missing values
Best way to count records by arbitrary time intervals in Rails+Postgres

Here is the query:

SELECT
    upper(trim(t.full_name))               AS teacher
  , date_trunc('month', s.study_dt)::date  AS study_month
  , r.room_code                            AS room
  , COUNT(1)                               AS study_count
FROM
  studies                           AS s
  LEFT OUTER JOIN rooms             AS r   ON r.id = s.room_id
  LEFT OUTER JOIN teacher_contacts  AS tc  ON tc.id = s.teacher_contact_id
  LEFT OUTER JOIN teachers          AS t   ON t.id = tc.teacher_id 
WHERE
  s.study_dt BETWEEN now() - interval '13 month' AND now()
  AND s.study_dt IS NOT NULL
GROUP BY
    teacher
  , study_month
  , room
ORDER BY 
    teacher  
  , study_month
  , room;

The output I get:

"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-12-01","A2",1
"DOE, JOHN","2016-01-01","B1",1
"SIMPSON, HOMER","2016-05-01","B2",3
"MOUSE, MICKEY","2015-08-01","A2",1
"MOUSE, MICKEY","2015-11-01","B1",1
"MOUSE, MICKEY","2015-11-01","B2",2

But I want count of 0 to show for all missing year-month and room combinations. For example (just first rows, there are 4 rooms in all: A1, A2, B1, B2):

"teacher","study_month","room","study_count"
"DOE, JOHN","2015-07-01","A1",1
"DOE, JOHN","2015-07-01","A2",0
"DOE, JOHN","2015-07-01","B1",0
"DOE, JOHN","2015-07-01","B2",0
...
"DOE, JOHN","2015-12-01","A1",1
"DOE, JOHN","2015-12-01","A2",0
"DOE, JOHN","2015-12-01","B1",0
"DOE, JOHN","2015-12-01","B2",0
...

To get the missing year-months, I tried left outer join on using time series and joining on time_range.year_month = study_month, but it didn't work.

SELECT date_trunc('month', time_range)::date AS year_month
FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range

So, I'd like to know how to 'fill in the gaps' for

a) both year-month and room and, as a bonus: b) just a year-month.

The reason for this is that the dataset would be fed to a pivot library to that we can get an output similar to following (could not do this in PG directly):

teacher,room,2015-07,...,2015-12,...,2016-07,total
"DOE, JOHN",A1,1,...,1,...,0,2
"DOE, JOHN",A2,0,...,0,...,0,0
...and so on...

回答1:

Based on some assumptions (ambiguities in the question) I suggest:

SELECT upper(trim(t.full_name)) AS teacher
     , m.study_month
     , r.room_code              AS room
     , count(s.room_id)         AS study_count

FROM   teachers t
CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                          , date_trunc('month', now())
                          , interval '1 month') m(study_month)
CROSS  JOIN rooms r
LEFT   JOIN (                                                  -- parentheses!
          studies s
   JOIN   teacher_contacts tc ON tc.id = s.teacher_contact_id  -- INNER JOIN!
   ) ON tc.teacher_id = t.id
    AND s.study_dt >= m.study_month
    AND s.study_dt <  m.study_month + interval '1 month'      -- sargable!
    AND s.room_id = r.id
GROUP  BY t.id, m.study_month, r.id  -- id is PK of respective tables
ORDER  BY t.id, m.study_month, r.id;

Major points

Build a grid of all desired combinations with CROSS JOIN. And then LEFT JOIN to existing rows. Related:
- array_agg group by and null
- Get created as well as deleted entries of last week
In your case, it's a join of several tables, so I use parentheses in the FROM list to LEFT JOIN to the result of INNER JOIN within the parentheses. It would be incorrect to LEFT JOIN to each table separately, because you would include hits on partial matches and get potentially incorrect counts.
Assuming referential integrity and working with PK columns directly, we don't need to include rooms and teachers on the left side a second time. But we still have a join of two tables (studies and teacher_contacts). The role of teacher_contacts is unclear to me. Normally, I would expect a relationship between studies and teachers directly. Might be further simplified ...
We need to count a non-null column on the left side to get the desired counts. Like count(s.room_id)
To keep this fast for big tables, make sure your predicates are sargable. And add matching indexes.
The column teacher is hardly (reliably) unique. Operate with a unique ID, preferably the PK (faster and simpler, too). I am still using teacher for the output to match your desired result. It might be wise to include a unique ID, since names can be duplicates.
You want:

the past 12 months (including current month).

So start with date_trunc('month', now() - interval '12 month' (not 13). That's rounding down the start already and does what you want - more accurately than your original query.

Since you mentioned slow performance, depending on actual table definitions and data distribution, it's probably faster to aggregate first and join later, like in this related answer:

Postgres - how to return rows with 0 count for missing data?

SELECT upper(trim(t.full_name)) AS teacher
     , m.mon                    AS study_month
     , r.room_code              AS room
     , COALESCE(s.ct, 0)        AS study_count

FROM   teachers t
CROSS  JOIN generate_series(date_trunc('month', now() - interval '12 month')  -- 12!
                          , date_trunc('month', now())
                          , interval '1 month') mon
CROSS  JOIN rooms r
LEFT   JOIN (                                                  -- parentheses!
   SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct
   FROM   studies s
   JOIN   teacher_contacts tc ON s.teacher_contact_id = tc.id
   WHERE  s.study_dt >= date_trunc('month', now() - interval '12 month')  -- sargable
   GROUP  BY 1, 2, 3
   ) s ON s.teacher_id = t.id
      AND s.mon = m.mon
      AND s.room_id = r.id
ORDER  BY 1, 2, 3;

About your closing remark:

the dataset would be fed to a pivot library ... (could not do this in PG directly)

Chances are you can use the two-parameter form of crosstab() to produce your desired result directly and with excellent performance and the above query is not needed to begin with. Consider:

PostgreSQL Crosstab Query

回答2:

You need to generate all the rows using a cross join and then join in studies and do an aggregation to get the count.

The resulting query should look like this:

select t.teacher, d.mon, r.room_code, count(s.teacher_contact_id)
from teachers t cross join
     rooms r cross join
     generate_series(date_trunc('month', now() - interval '13 month', 
                     date_trunc('month', now()),
                     interval '1 month'
                    ) d(mon) left join
     (select distinct date_trunc('month', s.study_dt)::date as mon) d left join
     teacher_contacts tc
     on tc.teacher_id = t.id left join
     studies s
     on tc.id = s.teacher_contact_id and
        date_trunc('month', s.study_dt) = d.mon
group by t.teacher, d.mon, r.room_code;

来源：https://stackoverflow.com/questions/38332433/how-to-include-missing-data-for-multiple-groupings-within-the-time-span

标签

sql

postgresql

group-by

left-join

aggregate-functions