Aggregation by timestamp

家住魔仙堡 提交于 2019-12-07 17:04:28

问题


SEO > SEO > Paid 1 Paid > Paid > Affiliate > Paid 1 SEO > Affiliate 1I have a query that results in a data containing customer id numbers, marketing channel, timestamp, and purchase date. So, the results might look something like this.

id marketingChannel TimeStamp      Transaction_date
1  SEO              5/18 23:11:43  5/18
1  SEO              5/18 24:12:43  5/18
1  Paid             5/18 24:13:43  5/18
2  Paid             5/18 24:12:43  5/18
2  Paid             5/18 24:14:43  5/18
2  Affiliate        5/18 24:20:43  5/18
2  Paid             5/18 24:22:43  5/18
3  SEO              5/18 24:10:43  5/18
3  Affiliate        5/18 24:11:43  5/18

I'm wondering if there is a query to aggregate this information in a fashion that show the count of marketing paths.

For example.

Marketing Path                  Count
SEO > SEO > Paid                  1
Paid > Paid > Affiliate > Paid    1
SEO > Affiliate                   1

I'm thinking about writing a Python script to get this information, but am wondering if there is a simple solution in SQL - as I'm not as framiliar with SQL.


回答1:


Some years ago I needed a similar result and I tested different ways to get a concatenated string in Teradata. Btw, all might fail if the number of rows is too high and the concatenated string exceeds 64000 chars.

The most efficient was a User Defined Function (written in C):

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT 
      DelimitedBuildSorted(MARKETINGCHANNEL
                          ,CAST(CAST(ts AS FORMAT 'yyyymmddhhmiss') AS VARCHAR(14))
                          ,'>') AS PATH
   FROM t
   GROUP BY id
 ) AS dt
GROUP BY 1;

If you need to run that query frequently and/or on a large table you might talk to your DBA if a UDF is possible (most DBAs don't like them as they're written in a language they don't know, C).

Recursion might be ok if the average number of rows per id is low. Joseph B's version can be a bit simplified, but the most important thing is to create a temporary table instead of using a View or Derived Table for the ROW_NUMBER calculation. This results in a better plan (in SQL Server, too):

CREATE VOLATILE TABLE vt AS 
 (
   SELECT
      id
     ,MarketingChannel
     ,ROW_NUMBER() OVER (PARTITION BY id ORDER BY TS DESC) AS rn
     ,COUNT(*) OVER (PARTITION BY id) AS max_rn
   FROM t
 ) WITH DATA 
PRIMARY INDEX (id) 
ON COMMIT PRESERVE ROWS;

WITH RECURSIVE cte(id, path, rn) AS
 (
   SELECT 
      id, 

      -- modify VARCHAR size to fit your maximum number of rows, that's better than VARCHAR(64000)
      CAST(MarketingChannel AS VARCHAR(10000)) AS PATH, 
      rn
   FROM vt
   WHERE rn = max_rn
   UNION ALL
   SELECT 
      cte.ID, 
      cte.PATH || '>' || vt.MarketingChannel, 
      cte.rn-1
   FROM vt JOIN cte
     ON vt.id = cte.id
    AND vt.rn = cte.rn - 1
 )
SELECT 
   PATH, 
   COUNT(*) 
FROM cte
WHERE rn = 1
GROUP BY path
ORDER BY PATH
;

You might also try old school MAX(CASE):

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT
      id
     ,MAX(CASE WHEN rnk =  0 THEN MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  1 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  2 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  3 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  4 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  5 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  6 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  7 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  8 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  9 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 10 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 11 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 12 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 13 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 14 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 15 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 16 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 17 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 18 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 19 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 20 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 21 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 22 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 23 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 24 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 25 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 26 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 27 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 28 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 29 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 30 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 31 THEN '>' || MarketingChannel ELSE '' END) AS PATH
   FROM
    (
     SELECT
        id
       ,TRIM(MarketingChannel) AS MarketingChannel
       ,RANK() OVER (PARTITION BY id
                     ORDER BY TS) -1 AS rnk
     FROM t
    ) dt
   GROUP BY 1
 ) AS dt
GROUP BY 1;

I had up to concat 2048 rows with 30 chars each :-)

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT
      id
     ,MAX(CASE WHEN rnk MOD 16 = 0 THEN path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 1 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 2 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 3 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 4 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 5 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 6 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 7 THEN '>' || path ELSE '' END) AS PATH
   FROM
    (
     SELECT
        id
       ,rnk / 16 AS rnk
       ,MAX(CASE WHEN rnk MOD 16 =  0 THEN path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  1 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  2 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  3 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  4 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  5 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  6 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  7 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  8 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  9 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 10 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 11 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 12 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 13 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 14 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 15 THEN '>' || path ELSE '' END) AS path
     FROM
      (
       SELECT
          id
         ,rnk / 16 AS rnk
         ,MAX(CASE WHEN rnk MOD 16 =  0 THEN path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  1 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  2 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  3 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  4 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  5 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  6 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  7 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  8 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  9 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 10 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 11 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 12 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 13 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 14 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 15 THEN '>' || path ELSE '' END) AS path
       FROM
        (
         SELECT
            id
           ,TRIM(MarketingChannel) AS PATH
           ,RANK() OVER (PARTITION BY id
                         ORDER BY TS) -1 AS rnk
         FROM t
        ) dt
       GROUP BY 1,2
      ) dt
     GROUP BY 1,2
    ) dt
   GROUP BY 1
 ) dt
GROUP BY 1



回答2:


Here's is a query, which has been tested with SQL Server. The same syntax should work with Teradata as well:

EDIT:

Converted multiple CTE's to a single CTE:

WITH RECURSIVE Single_Path (CURRENT_ID, CURRENT_PATH, CURRENT_TS, rn) AS
(
  SELECT 
    ID CURRENT_ID, 
    CAST(MARKETINGCHANNEL AS VARCHAR(MAX)) CURRENT_PATH, 
    TIMESTAMP CURRENT_TS, 
    1 RN
  FROM 
  (
    SELECT 
      id, 
      marketingChannel, 
      TimeStamp, 
      ROW_NUMBER() OVER (PARTITION BY id ORDER BY TimeStamp DESC) rn
    FROM T
  ) Ordered_Data
  WHERE RN = 1
  UNION ALL
  SELECT 
    ID, 
    CAST(MARKETINGCHANNEL + ' > ' + CURRENT_PATH AS VARCHAR(MAX)), 
    TIMESTAMP, 
    sp.rn+1
  FROM 
  (
    SELECT 
      id, 
      marketingChannel, 
      TimeStamp, 
      ROW_NUMBER() OVER (PARTITION BY id ORDER BY TimeStamp DESC) rn
    FROM T
  ) ORDERED_DATA od, Single_Path sp
  WHERE od.id = sp.Current_id
  AND od.rn = sp.rn + 1
)
SELECT 
  sp2.CURRENT_PATH MARKETING_PATH, 
  COUNT(*) COUNT
FROM Single_Path sp2
INNER JOIN 
(
  SELECT 
    ID, 
    MAX(rn) max_rn
  FROM Ordered_Data
  GROUP BY ID
) MR
ON SP2.CURRENT_ID = MR.ID AND SP2.RN = MR.MAX_RN
GROUP BY SP2.CURRENT_PATH
ORDER BY sp2.CURRENT_PATH;

SQL Fiddle demo

References:

Fun with Recursive SQL (Part 1) on Sharpening Stones blog




回答3:


Assuming MySQL:

select
path, count(*) from (
   select
   id, group_concat(marketingChannel separator ' > ') as path
   from
   t
   group by id
) sq 
group by path
  • see it working live in an sqlfiddle


来源:https://stackoverflow.com/questions/23741925/aggregation-by-timestamp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!