Handling of generate_series() in queries with date or timestamp with / without time zone

喜你入骨 提交于 2021-01-04 03:16:43

问题


I have a query to generate a report based on a date series that is grouped by date and employee_id. The date should be based on a particular time zone, in this case 'Asia/Kuala_Lumpur'. But this can change depending on where the user's time zone is.


SELECT 
     d::date AT TIME ZONE 'Asia/Kuala_Lumpur' AS created_date,  
     e.id,  
     e.name,
     e.division_id,
     ARRAY_AGG(
       a.id
     ) as rows,        
     MIN(a.created_at) FILTER (WHERE a.activity_type = 1) as min_time_in,
     MAX(a.created_at) FILTER (WHERE a.activity_type = 2) as max_time_out,
     ARRAY_AGG(
       CASE
           WHEN a.activity_type = 1
           THEN a.created_at
           ELSE NULL
       END
     ) as check_ins,
     ARRAY_AGG(
       CASE
           WHEN a.activity_type = 2
           THEN a.created_at
           ELSE NULL
       END
     ) as check_outs        
FROM    (SELECT MIN(created_at), MAX(created_at) FROM attendance) AS r(startdate,enddate)
  , generate_series(
        startdate::timestamp, 
        enddate::timestamp, 
        interval '1 day') g(d)
    CROSS JOIN  employee e
    LEFT JOIN   attendance a ON a.created_at::date = d::date AND e.id = a.employee_id
    where d::date = date '2020-11-20' and division_id = 1
GROUP BY 
    created_date
  , e.id
  , e.name
  , e.division_id
ORDER BY 
    created_date
  , e.id;

Definition and sample data for table attendance:

CREATE TABLE attendance (
    id int,
    employee_id int,
    activity_type int,
    created_at timestamp with time zone NOT NULL
);

INSERT INTO attendance VALUES
( 1, 1, 1,'2020-11-18 07:10:25 +00:00'),
( 2, 2, 1,'2020-11-18 07:30:25 +00:00'),
( 3, 3, 1,'2020-11-18 07:50:25 +00:00'),
( 4, 2, 2,'2020-11-18 19:10:25 +00:00'),
( 5, 3, 2,'2020-11-18 19:22:38 +00:00'),
( 6, 1, 2,'2020-11-18 20:01:05 +00:00'),
( 7, 1, 1,'2020-11-19 07:11:23 +00:00'),
( 8, 1, 2,'2020-11-19 16:21:53 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_outs field in the results output)
( 9, 1, 1,'2020-11-19 19:11:23 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_ins field in the results output)
(10, 1, 2,'2020-11-19 20:21:53 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_outs field in the results output)
(11, 1, 1,'2020-11-20 07:41:38 +00:00'),
(12, 1, 2,'2020-11-20 08:52:01 +00:00');

Here is a fiddle to test.

The query does not include rows 8-10 in the output for the time zone Asia/Kuala_Lumpur +8, though it should. Result shows "rows" field 11,12.

How can I fix the query so that it generates the report based on dates of a given time zone? (Meaning I can change Asia/Kuala_Lumpur to America/New_York etc.)

I was told to do something like this:

where created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur'
and   created_at <  timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' + interval '1 day'

But I am not sure how to apply it. Does not seem to work properly in this fiddle. It should include rows 8,9,10,11,12 but only rows 8,9,10 show up.


回答1:


DB design

Consider some modifications to your setup:

CREATE TABLE employee (
  id           int PRIMARY KEY  -- !
, name         text             -- do NOT use char(n) !
, division_id  int
);

CREATE  TABLE attendance (
  id             int PRIMARY KEY  --!
, employee_id    int NOT NULL REFERENCES employee -- FK!
, activity_type  int
, created_at     timestamptz NOT NULL
);

Defining a PK makes it easier to aggregate rows, because the PK covers the whole row in the GROUP BY clause. See:

  • Why can't I exclude dependent columns from `GROUP BY` when I aggregate by a key?

I wouldn't use "name" as column name. It's not descriptive. Every other column could be named "name". Consider:

  • Any downsides of using data type "text" for storing strings?
  • How to implement a many-to-many relationship in PostgreSQL?

Query

SELECT *
FROM  (        -- complete employee/date grid for division in range
   SELECT g.d::date AS the_date, id AS employee_id, name, division_id
   FROM  (
      SELECT generate_series(MIN(created_at) AT TIME ZONE 'Asia/Kuala_Lumpur'
                           , MAX(created_at) AT TIME ZONE 'Asia/Kuala_Lumpur'
                           , interval '1 day')
      FROM   attendance
      ) g(d)
   CROSS  JOIN employee e
   WHERE  e.division_id = 1
   ) de
LEFT   JOIN (  -- checkins & checkouts per employee/date for division in range
   SELECT employee_id, ts::date AS the_date
        , array_agg(id) as rows
        , min(ts)             FILTER (WHERE activity_type = 1) AS min_check_in
        , max(ts)             FILTER (WHERE activity_type = 2) AS max_check_out
        , array_agg(ts::time) FILTER (WHERE activity_type = 1) AS check_ins
        , array_agg(ts::time) FILTER (WHERE activity_type = 2) AS check_outs
   FROM  (
      SELECT a.id, a.employee_id, a.activity_type, a.created_at AT TIME ZONE 'Asia/Kuala_Lumpur' AS ts  -- convert to timestamp
      FROM   employee   e
      JOIN   attendance a ON a.employee_id = e.id
   -- WHERE  a.created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' -- "sargable" expressions
   -- AND    a.created_at <  timestamp '2020-11-21' AT TIME ZONE 'Asia/Kuala_Lumpur' -- exclusive upper bound (includes all of 2020-11-20);
      AND    e.division_id = 1
      ORDER  BY a.employee_id, a.created_at, a.activity_type  -- optional to guarantee sorted arrays
   ) sub
   GROUP  BY 1, 2
   ) a USING (the_date, employee_id)
ORDER  BY 1, 2;

db<>fiddle here

Note that my query outputs local date and time for Asia/Kuala_Lumpur:

test=> SELECT timestamptz '2020-11-20 08:52:01 +0' AT TIME ZONE 'Asia/Kuala_Lumpur' AS local_ts;
      local_ts       
---------------------
 2020-11-20 16:52:01

Where to start? You need to understand the concepts of time zones and the Postgres data types timestamp with time zone (timestamptz) vs. timestamp without time zone (timestamp). Else, it will be confusion without end. Start here:

  • Ignoring time zones altogether in Rails and PostgreSQL

Most notably, timestamptz does not store a time zone:

  • Time zone storage in data type "timestamp with time zone"

When simply casting timestamptz to date or timestamp, the current time zone setting of the session is assumed. Not what you want. Provide a time zone explicitly with the AT TIME ZONE construct to avoid this pifall. In your fiddle you have both:

  ...
  , generate_series(
        startdate::timestamp AT TIME ZONE 'Asia/Kuala_Lumpur', 
        enddate::timestamp AT TIME ZONE 'Asia/Kuala_Lumpur', 
        interval '1 day') g(d)
   ...

Also not doing what you want. After the (faulty!) cast to timestamp, the AT TIME ZONE construct translates the values back to timestamptz.

Also, your query generates the complete Cartesian Product of all users and and the maximum range of days in the the table attendance, only to reduce it back to a single day with:

    where created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur'
    and   created_at <  timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' + interval '1 day'

The WHERE clause finally does what it's supposed to do. But it makes no sense to first generate the full range of days, only to throw away most of it. (Seems you copied that from my other fiddle in the meantime?)

I commented out the WHERE clause and kept an optimized version of your generate_series() in my query as proof of concept. Further reading:

  • Generating time series between two dates in PostgreSQL


来源:https://stackoverflow.com/questions/65278993/handling-of-generate-series-in-queries-with-date-or-timestamp-with-without-t

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!