Multiple Self-Join based on GROUP BY results

牧云@^-^@ 提交于 2020-01-09 08:17:57

问题


I'm attempting to collect details about backup activity from a ProgreSQL DB table on a backup appliance (Avamar). The table has several columns including: client_name, dataset, plugin_name, type, completed_ts, status_code, bytes_modified and more. Simplified example:

| session_id | client_name | dataset |         plugin_name |             type |         completed_ts | status_code | bytes_modified |
|------------|-------------|---------|---------------------|------------------|----------------------|-------------|----------------|
|          1 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-05T01:00:00Z |       30900 |       11111111 |
|          2 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-04T01:00:00Z |       30000 |       22222222 |
|          3 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-03T01:00:00Z |       30000 |       22222222 |
|          4 |    server01 | Windows | Windows File System | Scheduled Backup | 2017-12-02T01:00:00Z |       30000 |       22222222 |
|          5 |    server01 | Windows |         Windows VSS | Scheduled Backup | 2017-12-01T01:00:00Z |       30000 |       33333333 |
|          6 |    server02 | Windows | Windows File System | Scheduled Backup | 2017-12-05T02:00:00Z |       30000 |       44444444 |
|          7 |    server02 | Windows | Windows File System | Scheduled Backup | 2017-12-04T02:00:00Z |       30900 |       55555555 |
|          8 |    server03 | Windows | Windows File System | On-Demand Backup | 2017-12-05T03:00:00Z |       30000 |       66666666 |
|          9 |    server04 | Windows | Windows File System |         Validate | 2017-12-05T03:00:00Z |       30000 |       66666666 |

Each client_name (server) can have multiple datasets, and each dataset can have multiple plugin_names. So I have a created a SQL statement that does a GROUP BY of these three columns to get a list of "job" activity over time. (http://sqlfiddle.com/#!15/f15556/1)

select
  client_name,
  dataset,
  plugin_name
from v_activities_2
where
  type like '%Backup%'
group by
  client_name, dataset, plugin_name

Each of these Jobs can be successful or fail based on a status_code column. Using self-join with subqueries I'm able to get results of the Last Good backup along with it's completed_ts (completed time) and bytes_modified and more: (http://sqlfiddle.com/#!15/f15556/16)

select
  a2.client_name,
  a2.dataset,
  a2.plugin_name,
  a2.LastGood,
  a3.status_code,
  a3.bytes_modified as LastGood_bytes
from v_activities_2 a3

join (
  select
    client_name,
    dataset,
    plugin_name,
    max(completed_ts) as LastGood
  from v_activities_2 a2
  where
    type like '%Backup%'
    and status_code in (30000,30005)   -- Successful (Good) Status codes
  group by
    client_name, dataset, plugin_name
) as a2
on a3.client_name  = a2.client_name and
   a3.dataset      = a2.dataset and
   a3.plugin_name  = a2.plugin_name and
   a3.completed_ts = a2.LastGood

I can do the same thing separately to get the Last Attempt details by removing WHERE's status_code line: http://sqlfiddle.com/#!15/f15556/3. Note that most times LastGood and LastAttempted are the same row but sometimes they are not, depending if the last backup was successful.

What I'm having problems with is merging these two statements together (if possible). So I will get this result:

| client_name | dataset |         plugin_name |             lastgood |  lastgood_bytes |          lastattempt | lastattempt_bytes |
|-------------|---------|---------------------|----------------------|-----------------|----------------------|-------------------|
|    server01 | Windows | Windows File System | 2017-12-04T01:00:00Z |        22222222 | 2017-12-05T01:00:00Z |          11111111 |
|    server01 | Windows |         Windows VSS | 2017-12-01T01:00:00Z |        33333333 | 2017-12-01T01:00:00Z |          33333333 |
|    server02 | Windows | Windows File System | 2017-12-05T02:00:00Z |        44444444 | 2017-12-05T02:00:00Z |          44444444 |
|    server03 | Windows | Windows File System | 2017-12-05T03:00:00Z |        66666666 | 2017-12-05T03:00:00Z |          66666666 |

I attempted just adding another RIGHT JOIN to the end (http://sqlfiddle.com/#!15/f15556/4) and getting NULL rows. After doing some reading I see that the first two JOINs run first creating a temporary table before the 2nd join occurs, but at that point the data I need is lost so I get NULL rows.

Using PostgreSQL 8 via groovy scripting. I also only have read-only access to the DB.


回答1:


You apparently have two intermediate inner join output tables and you want to get columns from each about some things identified by a common key. So inner join them on the key.

select
  g.client_name,
  g.dataset,
  g.plugin_name,
  LastGood,
  g.status_code,
  LastGood_bytes
  LastAttempt,
  l.status_code,
  LastAttempt_bytes
from
( -- cut & pasted Last Good http://sqlfiddle.com/#!15/f15556/16
    select
      a2.client_name,
      a2.dataset,
      a2.plugin_name,
      a2.LastGood,
      a3.status_code,
      a3.bytes_modified as LastGood_bytes
    from v_activities_2 a3
    join (
      select
        client_name,
        dataset,
        plugin_name,
        max(completed_ts) as LastGood
      from v_activities_2 a2
      where
        type like '%Backup%'
        and status_code in (30000,30005)   -- Successful (Good) Status codes
      group by
        client_name, dataset, plugin_name
    ) as a2
    on a3.client_name  = a2.client_name and
       a3.dataset      = a2.dataset and
       a3.plugin_name  = a2.plugin_name and
       a3.completed_ts = a2.LastGood
) as g
join 
( -- cut & pasted Last Attempt http://sqlfiddle.com/#!15/f15556/3
    select
      a1.client_name,
      a1.dataset,
      a1.plugin_name,
      a1.LastAttempt,
      a3.status_code,
      a3.bytes_modified as LastAttempt_bytes
    from v_activities_2 a3
    join (
      select
        client_name,
        dataset,
        plugin_name,
        max(completed_ts) as LastAttempt
      from v_activities_2 a2
      where
        type like '%Backup%'
      group by
        client_name, dataset, plugin_name
    ) as a1
    on a3.client_name  = a1.client_name and
       a3.dataset      = a1.dataset and
       a3.plugin_name  = a1.plugin_name and
       a3.completed_ts = a1.LastAttempt
) as l
on l.client_name  = g.client_name and
   l.dataset      = g.dataset and
   l.plugin_name  = g.plugin_name
order by client_name, dataset, plugin_name

This uses one of the applicable approaches in Strange duplicate behavior from GROUP_CONCAT of two LEFT JOINs of GROUP_BYs. However the correspondence of chunks of code might not be so clear. Its intermediate are left vs your inner & group_concat is your max. (But it has more approaches because of particulars of group_concat & its query.)

A correct symmetrical INNER JOIN approach: LEFT JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT (which is what your first query did); then separately similarly LEFT JOIN q1 & q3--1:many--then GROUP BY & GROUP_CONCAT; then INNER JOIN the two results ON user_id--1:1.

A correct cumulative LEFT JOIN approach: JOIN q1 & q2--1:many--then GROUP BY & GROUP_CONCAT; then left join that & q3--1:many--then GROUP BY & GROUP_CONCAT.

Whether this actually serves your purpose in general depends on your actual specification and constraints. Even if the two joins you link are what you want you need to explain exactly what you mean by "merge". You don't say what you want if the joins have different sets of values for the grouped columns. Force yourself to use the English language to say what rows go in the result based on what rows are in the input.

PS 1 You have undocumented/undeclared/unenforced constraints. Please declare when possible. Otherwise enforce by triggers. Document in question text if not in code. Constraints are fundamental to multiple subrow value instances in join & to group by.

PS 2 Learn the syntax/semantics for select. Learn what left/right outer join ons return--whatinner join on does plus unmatched left/right table rows extended by nulls.

PS 3 Is there any rule of thumb to construct SQL query from a human-readable description?




回答2:


Here is an alternate way that also works but harder to follow and likely more particular to my dataset: http://sqlfiddle.com/#!15/f15556/114

select
  Actvty.client_name,
  Actvty.dataset,
  Actvty.plugin_name,
  ActvtyGood.LastGood,
  ActvtyGood.status_code as LastGood_status,
  ActvtyGood.bytes_modified as LastGood_bytes,
  ActvtyOnly.LastAttempt,
  Actvty.status_code as LastAttempt_status,
  Actvty.bytes_modified as LastAttempt_bytes
from v_activities_2 Actvty

-- 1. Get last attempt of each job (which may or may not match last good)
join (
  select
    client_name,
    dataset,
    plugin_name,
    max(completed_ts) as LastAttempt
  from v_activities_2
  where
    type like '%Backup%'
  group by
    client_name, dataset, plugin_name
) as ActvtyOnly
on Actvty.client_name  = ActvtyOnly.client_name and
   Actvty.dataset      = ActvtyOnly.dataset and
   Actvty.plugin_name  = ActvtyOnly.plugin_name and
   Actvty.completed_ts = ActvtyOnly.LastAttempt

-- 4. join the list of good runs with the table of last attempts, there would never be a job that has a last good without also a last attempt.
join (

  -- 3. join last good runs with the full table to get the additional details of each
  select
    ActvtyGoodSub.client_name,
    ActvtyGoodSub.dataset,
    ActvtyGoodSub.plugin_name,
    ActvtyGoodSub.LastGood,
    ActvtyAll.status_code,
    ActvtyAll.bytes_modified
  from v_activities_2 ActvtyAll

  -- 2. Get last Good run of each job
  join (
    select
      client_name,
      dataset,
      plugin_name,
      max(completed_ts) as LastGood
    from v_activities_2
    where
      type like '%Backup%'
      and status_code in (30000,30005)   -- Successful (Good) Status codes
    group by
      client_name, dataset, plugin_name
  ) as ActvtyGoodSub
  on ActvtyAll.client_name  = ActvtyGoodSub.client_name and
     ActvtyAll.dataset      = ActvtyGoodSub.dataset and
     ActvtyAll.plugin_name  = ActvtyGoodSub.plugin_name and
     ActvtyAll.completed_ts = ActvtyGoodSub.LastGood

) as ActvtyGood
on Actvty.client_name  = ActvtyGood.client_name and
   Actvty.dataset      = ActvtyGood.dataset and
   Actvty.plugin_name  = ActvtyGood.plugin_name


来源:https://stackoverflow.com/questions/47758492/multiple-self-join-based-on-group-by-results

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!