How do I grab the “next” event when the offset is variable for items that can be repeatedly processed?

半世苍凉 提交于 2019-12-25 03:55:40

问题


This question is virtually identical to another I recently asked, with the very important distinction that these transactions are loan transactions and, therefore, items may reappear in the data multiple times. This is why I'm currently using LEAD. With that clarification, I repost my question below.

I have a table of transactions in an Oracle database. I am attempting to pull a report together for a delivery system involving a number of transaction types. The "request" type can actually be one of four sub-types ('A', 'B', 'C', and 'D' for this example), and the "delivery" type can be one of four different sub-types ('PULL', 'PICKUP', 'MAIL'). There can be anywhere from 1 to 5 transactions to get an item from "request" to "delivery, and a number of the "delivery" types are also intermediary transactions. Example:

Item | Transaction | Timestamp
001  | REQ-A       | 2014-07-31T09:51:32Z
002  | REQ-B       | 2014-07-31T09:55:53Z
003  | REQ-C       | 2014-07-31T10:01:15Z
004  | REQ-D       | 2014-07-31T10:02:29Z
005  | REQ-A       | 2014-07-31T10:05:47Z
002  | PULL        | 2014-07-31T10:20:04Z
002  | MAIL        | 2014-07-31T10:20:06Z
001  | PULL        | 2014-07-31T10:22:21Z
001  | TRANSFER    | 2014-07-31T10:22:23Z
003  | PULL        | 2014-07-31T10:24:10Z
003  | TRANSFER    | 2014-07-31T10:24:12Z
004  | PULL        | 2014-07-31T10:26:28Z
005  | PULL        | 2014-07-31T10:28:42Z
005  | TRANSFER    | 2014-07-31T10:28:44Z
001  | ARRIVE      | 2014-07-31T11:45:01Z
001  | PICKUP      | 2014-07-31T11:45:02Z
003  | ARRIVE      | 2014-07-31T11:47:44Z
003  | PICKUP      | 2014-07-31T11:47:45Z
005  | ARRIVE      | 2014-07-31T11:49:45Z
005  | PICKUP      | 2014-07-31T11:49:46Z

What I need is a report like:

Item | Start Tx | End Tx | Time
001  | REQ-A    | PICKUP | 1:53:30
002  | REQ-B    | MAIL   | 0:24:13
003  | REQ-C    | PICKUP | 1:46:30
004  | REQ-D    | PULL   | 0:23:59
005  | REQ-A    | PICKUP | 1:43:59

What I have:

Item | Start Tx | End Tx   | Time
001  | REQ-A    | PULL     | 0:30:49
001  | REQ-A    | TRANSFER | 0:30:51
001  | REQ-A    | ARRIVE   | 1:53:29
001  | REQ-A    | PICKUP   | 1:53:30
002  | REQ-B    | PULL     | 0:24:11
002  | REQ-B    | MAIL     | 0:24:13
003  | REQ-C    | PULL     | 0:22:55
003  | REQ-C    | TRANSFER | 0:22:57
003  | REQ-C    | ARRIVE   | 1:46:29
003  | REQ-C    | PICKUP   | 1:46:30
004  | REQ-D    | PULL     | 0:23:59
005  | REQ-A    | PULL     | 0:22:55
005  | REQ-A    | TRANSFER | 0:22:57
005  | REQ-A    | ARRIVE   | 1:43:58
005  | REQ-A    | PICKUP   | 1:43:59

What I'm doing to get that data:

SELECT Item, Transaction, nextTransaction, nextTimestamp - Timestamp
FROM (
    SELECT Item, Transaction, Timestamp,
      LEAD(Transaction, 5) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTransaction"
      LEAD(Timestamp, 5) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTimestamp"
    FROM Transactions
    UNION ALL 
    SELECT Item, Transaction, Timestamp,
      LEAD(Transaction, 4) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTransaction"
      LEAD(Timestamp, 4) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTimestamp"
    FROM Transactions
    UNION ALL 
    SELECT Item, Transaction, Timestamp,
      LEAD(Transaction, 3) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTransaction"
      LEAD(Timestamp, 3) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTimestamp"
    FROM Transactions
    UNION ALL 
    SELECT Item, Transaction, Timestamp,
      LEAD(Transaction, 2) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTransaction"
      LEAD(Timestamp, 2) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTimestamp"
    FROM Transactions
    UNION ALL 
    SELECT Item, Transaction, Timestamp,
      LEAD(Transaction, 1) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTransaction"
      LEAD(Timestamp, 1) OVER (PARTITION BY Item ORDER BY Timestamp) AS "nextTimestamp"
    FROM Transactions
)
WHERE nextTransaction IS NOT NULL
AND Transaction IN ('REQ-A', 'REQ-B', 'REQ-C', 'REQ-D')

I could manually parse this in a script (and perhaps that's actually the best course of action), but for the sake of learning, I'd like to know if it's possible to actually do this with SQL alone.

To clarify the "loan" bit, there are other transactions in this table for returns and other forms of processing that are irrelevant to this report beyond existing as other transaction types. Once an item is returned, it can go through the request cycle again. As an example, for item 001, it could then follow item 002's cycle (REQ -> MAIL), it could then get a "Not on shelf" transaction, or a non-request loan, or a few other use cases. It could then go back through the REQ -> PICKUP cycle, or the REQ->PULL cycle.


回答1:


This is a gaps-and-islands problem, but the islands being defined by a REQ transaction make it a bit more complicated than some.

You could use nested lead and lag functions and some manipulation to get what you need:

select distinct item,
  coalesce(start_tran,
    lag(start_tran) over (partition by item order by timestamp)) as start_tran,
  coalesce(end_tran,
    lead(end_tran) over (partition by item order by timestamp)) as end_tran,
  coalesce(end_time, 
    lead(end_time) over (partition by item order by timestamp))
    - coalesce(start_time,
        lag(start_time) over (partition by item order by timestamp)) as time
from (
  select item, timestamp, start_tran, start_time, end_tran, end_time
  from (
    select item,
      timestamp,
      case when lag_tran is null or transaction like 'REQ%'
        then transaction end as start_tran,
      case when lag_tran is null or transaction like 'REQ%'
        then timestamp end as start_time,
      case when lead_tran is null or lead_tran like 'REQ%'
        then transaction end as end_tran,
      case when lead_tran is null or lead_tran like 'REQ%'
        then timestamp end as end_time
    from (
      select item, transaction, timestamp,
        lag(transaction)
          over (partition by item order by timestamp) as lag_tran,
        lead(transaction)
          over (partition by item order by timestamp) as lead_tran
      from transactions
    )
  )
  where start_tran is not null or end_tran is not null
)
order by item, start_tran;

With additional records for a second cycle for items 1 and 2 that could give:

      ITEM START_TRAN END_TRAN   TIME      
---------- ---------- ---------- -----------
         1 REQ-A      PICKUP     0 1:53:30.0 
         1 REQ-E      PICKUP     0 1:23:30.0 
         2 REQ-B      MAIL       0 0:24:13.0 
         2 REQ-F      REQ-F      0 0:0:0.0   
         3 REQ-C      PICKUP     0 1:46:30.0 
         4 REQ-D      PULL       0 0:23:59.0 
         5 REQ-A      PICKUP     0 1:43:59.0 

SQL Fiddle showing all the intermediate steps.

It's not quite as scary as it might look at first glance. The innermost query takes the raw data and adds an extra column for the lead and lag transactions. Taking just the first set of item-1 records that would be:

      ITEM TRANSACTION TIMESTAMP                LAG_TRAN   LEAD_TRAN
---------- ----------- ------------------------ ---------- ----------
         1 REQ-A       2014-07-31T09:51:32Z                PULL       
         1 PULL        2014-07-31T10:22:21Z     REQ-A      TRANSFER   
         1 TRANSFER    2014-07-31T10:22:23Z     PULL       ARRIVE     
         1 ARRIVE      2014-07-31T11:45:01Z     TRANSFER   PICKUP     
         1 PICKUP      2014-07-31T11:45:02Z     ARRIVE     REQ-E      

Notice REQ-E popping up as the last lead_tran? That's the first transaction for the second cycle of records for this item, and is going to be useful later. The next level of query uses those lead and lag values and treats REQ values as start and end markers, and uses that information to null out everything except the first and last record for each cycle.

      ITEM TIMESTAMP                START_TRAN START_TIME               END_TRAN   END_TIME               
---------- ------------------------ ---------- ------------------------ ---------- ------------------------
         1 2014-07-31T09:51:32Z     REQ-A      2014-07-31T09:51:32Z                                         
         1 2014-07-31T10:22:21Z                                                                             
         1 2014-07-31T10:22:23Z                                                                             
         1 2014-07-31T11:45:01Z                                                                             
         1 2014-07-31T11:45:02Z                                         PICKUP     2014-07-31T11:45:02Z     

The next level of query removes any rows which are not representing the start or end (or both - see REQ-F in the Fiddle) as we aren't interested in them:

      ITEM TIMESTAMP                START_TRAN START_TIME               END_TRAN   END_TIME               
---------- ------------------------ ---------- ------------------------ ---------- ------------------------
         1 2014-07-31T09:51:32Z     REQ-A      2014-07-31T09:51:32Z                                         
         1 2014-07-31T11:45:02Z                                         PICKUP     2014-07-31T11:45:02Z     

We now have pairs of rows for each cycle (or a single row for REQ-F). The final level uses lead and lag again to fill in the blanks; if the start_tran is null then this is an end-row and we should use the previous row's start data; if end_tran is null then this is a start-row and we should use the next row's end data.

  ITEM START_TRAN START_TIME               END_TRAN   END_TIME                 TIME      

     1 REQ-A      2014-07-31T09:51:32Z     PICKUP     2014-07-31T11:45:02Z     0 1:53:30.0 
     1 REQ-A      2014-07-31T09:51:32Z     PICKUP     2014-07-31T11:45:02Z     0 1:53:30.0 

That makes both rows the same, so the distinct removes the duplicates.




回答2:


This should give you the same result. I am reproducing Gordon's answer and it still holds good

select item,
       min(transaction) keep (dense_rank first order by timestamp) as StartTx, 
       min(transaction) keep (dense_rank last order by timestamp) as EndTx,
       max(timestamp) - min(timestamp)
from transactions t
group by item;

Even though you have duplicates in the txn, it would be handled by the analytic function. See the documentation for dense rank and the keyword keep.



来源:https://stackoverflow.com/questions/25067292/how-do-i-grab-the-next-event-when-the-offset-is-variable-for-items-that-can-be

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!