Fetch rows based on condition

问题

I am using PostgreSQL on Amazon Redshift.

My table is :

drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34')  --row y

Expected output:

   'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
   'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'

I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.

I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds

I tried this :

SELECT a.app_nm
    ,min(a.start)
    ,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
    ON a.APP_nm = b.APP_nm
        AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1

It works but it doesn't return row y when conditions fails.

回答1:

There are two reasons that row y is not returned is due to the condition:

b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.

However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?

The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.

Some potential workarounds: DISTINCT ON like functionality for Redshift

The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.

This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.

回答2:

Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.

The complete solution might look like this:

SELECT
  start AS session_start,
  session_end
FROM (
       SELECT
         start,
         end1,
         lead(end1, 1)
         OVER (
           ORDER BY end1) AS session_end,
         session_boundary
       FROM (
              SELECT
                start,
                end1,
                CASE WHEN session_switch = 0 AND reverse_session_switch = 1
                  THEN 'start'
                ELSE 'end' END AS session_boundary
              FROM (
                     SELECT
                       start,
                       end1,
                       CASE WHEN datediff(seconds, end1, lead(start, 1)
                       OVER (
                         ORDER BY end1 ASC)) > 10
                         THEN 1
                       ELSE 0 END AS session_switch,
                       CASE WHEN datediff(seconds, lead(end1, 1)
                       OVER (
                         ORDER BY end1 DESC), start) > 10
                         THEN 1
                       ELSE 0 END AS reverse_session_switch
                     FROM app_tax
                   )
                AS sessioned
              WHERE session_switch != 0 OR reverse_session_switch != 0
              UNION
              SELECT
                start,
                end1,
                'start'
              FROM (
                     SELECT
                       start,
                       end1,
                       row_number()
                       OVER (PARTITION BY APP_nm
                         ORDER BY end1 ASC) AS row_num
                     FROM APP_Tax
                   ) AS with_row_number
              WHERE row_num = 1
            ) AS with_boundary
     ) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;

Here is the breadkdown (by subquery name):

sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows

with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

来源：https://stackoverflow.com/questions/35942628/fetch-rows-based-on-condition

标签

sql

postgresql

amazon-redshift