SQL - Recursive Tree Hierarchy with Record at Each Level

问题

Trying to do a classic hierarchy tree in SQL, using SAS (which does not support WITH RECURSIVE, so far as I know).

Here's simplified data structure in existing table:

|USER_ID|SUPERVISOR_ID|

So, to build a hierarchy, you just recursively join it x number of times to get data you are looking for, where SUPERVISOR_ID = USER_ID. In my company, it is 16 levels.

This issue comes when trying to get a branch to terminate for each user. For example, let's consider User A at level 1 has Users B,C,D, and E under them, at level 2. Thus, using a recursive LEFT JOIN, you would get:

| -- Level 1 -- | -- Level 2 -- |
     User A          User B
     User A          User C
     User A          User D
     User A          User E

Issue being, User A does not have their own terminating branch. End result needed is:

| -- Level 1 -- | -- Level 2 -- |
     User A           NULL         
     User A          User B
     User A          User C
     User A          User D
     User A          User E

My first blush thought is I can get around this by creating a temp table at each level then performing a UNION ALL on the results altogether, however that seems terribly inefficient given the size (16 levels) and am hoping I'm missing something here that is a cleaner solution.

回答1:

I'm not quite sure I understand the question, but if you're trying to generate a full listing of all employees under each supervisor then this is one way of doing it, assuming that each employee has a unique ID, which can appear in either the user or supervisor column:

data employees;
input SUPERVISOR_ID USER_ID;
cards;
1 2
1 3
1 4
2 5
2 6
2 7
7 8
;
run;

proc sql;
  create view distinct_employees as 
  select distinct SUPERVISOR_ID as USER_ID from employees
  union
  select distinct USER_ID from employees;
quit;

data hierarchy;
  if 0 then set employees;
  set distinct_employees;
  if _n_ = 1 then do;
    declare hash h(dataset:'employees');
    rc = h.definekey('USER_ID');
    rc = h.definedata('SUPERVISOR_ID');
    rc = h.definedone();
  end;
  T_USER_ID = USER_ID;
  do while(h.find() = 0);
    USER_ID = T_USER_ID;
    output;
    USER_ID = SUPERVISOR_ID;
  end;
  drop rc T_USER_ID;
run;

proc sort data = hierarchy;
  by SUPERVISOR_ID USER_ID;
run;

回答2:

Consider some simple process P that creates your rectangle of possible paths from a set of (super_id, user_id).

A path of length N is N levels deep and links up (N-1) relationships.

Are the values at each level distinct to that level ?

No? P will find cycles, cross-over paths and wrap-around paths when compared to the actual paths. A wrap around is when a node at actual path level > 1 is 'found' to be a level = 1 node.
Yes? P will find the paths, cross-over paths and wrap-around paths. Additional data restrictions or rules can help eliminate

Consider 4 simple paths with indistinct level values:

data path(keep=L1-L4) rels(keep=super_id user_id);
  array L(4);
  input L(*);
  output path;
  super_id = L(1);
  do i = 2 to dim(L);
    user_id = L(i);
    output rels;
    super_id = user_id;
  end;
datalines;
1 3 1 4
1 5 1 4
2 3 2 3
1 2 3 4
run;

There are 12 pieces of relationship only data. Neither the paths these pairs live in nor the level at which they exist is unknown:

An explicit 2-stage query for assembling the 4 level paths amongst the relations. If the code works it can be abstracted for macro coding.

proc sql;

  * RELS cross RELS, extensive i/o;
  * get on the induction ladder;

  create table ITER_1 as
  select distinct
    S.super_id as L3 /* parent^2 */
  , S.user_id as L2 /* parent */ 
  , U.user_id as L1 /* leaf */
  from RELS U
  cross join RELS S 
  where S.user_id = U.super_id
  order by L3, L2, L1
  ;

  * ITER_1 cross RELS, little less extensive i/o;
  * if you see the inductive variation you can macroize it;

  create table ITER_2 as
  select distinct
    S.super_id as L4 /* parent^3 */
  , U.L3 /* parent^2 */
  , U.L2 /* parent */
  , U.L1 /* leaf */
  from ITER_1 U
  cross join RELS S
  where S.user_id = U.L3
  order by L4, L3, L2, L1
  ;
quit;

The above assembler has no pair identity knowledge and can not restrict to paths of discrete pairs. So there will be cycles, cross-overs and wraps.

Found paths (some explanations)

 1 : 1 2 3 1   path 4 L3 xover to path 1 L2
 2 : 1 2 3 2   path 4 L3 xover to path 3 L2
 3 : 1 2 3 4   actual
 4 : 1 3 1 2   path 1 L3 xover to path 4 L1
 5 : 1 3 1 3
 6 : 1 3 1 4   actual
 7 : 1 3 1 5
 8 : 1 3 2 3
 9 : 1 5 1 2
10 : 1 5 1 3
11 : 1 5 1 4   actual
12 : 1 5 1 5
13 : 2 3 1 2
14 : 2 3 1 3
15 : 2 3 1 4
16 : 2 3 1 5
17 : 2 3 2 3   actual is actually a cycler too
18 : 3 1 2 3
19 : 3 1 3 1
20 : 3 1 3 2
21 : 3 1 3 4
22 : 3 1 5 1
23 : 3 2 3 1
24 : 3 2 3 2
25 : 3 2 3 4
26 : 5 1 2 3
27 : 5 1 3 1
28 : 5 1 3 2
29 : 5 1 3 4
30 : 5 1 5 1   path 2 L3 cycled to path 2 L1

If the ids at each relationship level are not found in any other level then cycles are implicitly eliminated. Cross-overs can not be eliminated because there is no path identity information. Same for wrap-arounds.

A more complicated SQL can ensure each relation in the found 'paths' appears only once and the content of the paths at distinct. Depending on the actual data you may still have a large number of false paths.

The highly regular code is suitable for macro-izing, however the actual SQL run-times are highly dependent on actual data and RELs data set indexing.

proc sql;

create table ITER_1 as
select 
  L3 /* parent^2 */
, L2 /* parent */ 
, L1 /* leaf */
, R1
, R2
from 
(
  select distinct
    S.super_id as L3 /* parent^2 */
  , S.user_id as L2 /* parent */ 
  , U.user_id as L1 /* leaf */
  , U.row_id as R1
  , S.row_id as R2
  , monotonic() as seq
  from RELS U
  cross join RELS S 
  where S.user_id = U.super_id
    and S.row_id < U.row_id  /* triangular constraint allowed due to symmetry */
)
group by L3, L2, L1
having seq = min(seq)
order by L3, L2, L1
;

create table ITER_2 as
select
  L4 /* parent^3 */ format=6.
, L3 /* parent^2 */ format=6.
, L2 /* parent */ format=6.
, L1 /* leaf */ format=6.
, R1 format=6.
, R2 format=6.
, R3 format=6.
from
(
  select distinct
    S.super_id as L4 /* parent^3 */ format=6.
  , U.L3 /* parent^2 */ format=6.
  , U.L2 /* parent */ format=6.
  , U.L1 /* leaf */ format=6.
  , U.R1 format=6.
  , U.R2 format=6.
  , S.row_id as R3 format=6.
  , monotonic() as seq
  from ITER_1 U
  cross join RELS S
  where S.user_id = U.L3
    and S.row_id ne R1
    and S.row_id ne R2
)
group by L4, L3, L2, L1
having seq = min(seq)
order by L4, L3, L2, L1
;

quit;

The final tweak for a NULL item would require even more SQL.

Is it possible to process the discovered hierarchies without needing the NULL? A DATA Step SET with BY processing can detect the end of a level using the LAST.

来源：https://stackoverflow.com/questions/47804104/sql-recursive-tree-hierarchy-with-record-at-each-level

标签

sql

recursion

sas

hierarchy