Joining 2 data sets via intersection

问题

I posted a version of this question previously but am struggling to get that answer working with this slightly different format of data working...hence i am reaching out again on this.

I have the following set of data (note the way to read the data below is ID1, Ford, has the following attributes and values A:B, B:C and C:D

+------------------------------------------------+
| ID     NAME     Attribute      Attribute Value |
+------------------------------------------------+
| 1      Ford         A                  B       |
| 1      Ford         B                  C       |
| 1      Ford         C                  D       |
| 2      BMW          A                  B       |
| 2      BMW          C                  D       |
| 2      BMW          F                  G       |
| 3      TESLA        Z                  Y       |
| 3      TESLA        E                  F       |
| 3      TESLA        A                  B       |
+------------------------------------------------+

I would basically like to compare each ID in the table against the rest and output a result. The first comparison would be to check ID 1 against 2 and 3 and do a comparison and see where the matches are, and where they are not.

ouput (1st comparison done only vs only 1 record):

+----------------------------------------------------------------------------+
| BaseID  BaseNAME   Target ID   TargetName    MatchedOn    Baseonly Tgtonly |
+----------------------------------------------------------------------------+
| 1        Ford         2          BMW           A:B;C:D     B:C     F:G     |
+----------------------------------------------------------------------------+

Previously a kind individual helped me implement a Cartesian product but the data was in a slightly different format - but it was a bit too slow. So i wanted to see if anyone had any ideas on the best way to go about getting to the desired result?

回答1:

This may be faster:

with 
  t1 as (select distinct a.id ia, a.name na, b.id ib, b.name nb 
           from t a join t b on a.id < b.id),
  t2 as (
    select ia, na, ib, nb, 
           cast(multiset(select attr||':'||val from t where id = ia intersect 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a1, 
           cast(multiset(select attr||':'||val from t where id = ia minus 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a2, 
           cast(multiset(select attr||':'||val from t where id = ib minus 
                         select attr||':'||val from t where id = ia ) 
                as sys.odcivarchar2list) a3 
      from t1)
select ia, na, ib, nb, 
       (select listagg(column_value, ';') within group (order by null) from table(t2.a1)) l1,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a2)) l2,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a3)) l3
  from t2
  order by ia, ib

^{dbfiddle demo}

subquery t1 creates pairs of "cars" we will compare
t2 gathers for each pair collections of common or different attributes. sys.odcivarchar2list is built-in type, just table of string

final query changes collections into list of strings. Result:

IA NA            IB NB    L1        L2           L3
-- ------------ --- ----- --------- ------------ -----------
 1 Ford           2 BMW   A:B;C:D   B:C          F:G
 1 Ford           3 TESLA A:B       B:C;C:D      E:F;Z:Y
 2 BMW            3 TESLA A:B       C:D;F:G      E:F;Z:Y

I hope this to be faster, because we're not using any user defined function and number of operations is minimized.

The alternative is to use something like this function:

-- find different or common attributes
create or replace function dca(i1 in number, i2 in number, op in char) 
  return varchar2 is 
  ret varchar2(1000);
begin 
  case op 
    when 'M' then -- minus
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 minus 
              select attr, val from t where id = i2 );
    when 'I' then -- intersect
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 intersect 
              select attr, val from t where id = i2 );
  end case;
  return ret;
end;

in this query:

select ia, na, ib, nb, 
       dca(ia, ib, 'I') ab, dca(ia, ib, 'M') a_b, dca(ib, ia, 'M') b_a 
  from (select distinct a.id ia, a.name na, b.id ib, b.name nb 
          from t a join t b on a.id < b.id)
  order by ia, ib;

It works too, but this is UDF which performs worse in queries.

回答2:

Works in Oracle 12+.

In 11g you can concatenate collection elements using listagg or UDF.

with
function collagg(p in sys.ku$_vcnt) return varchar2 is
result varchar2(4000);
begin
  for i in 1..p.count loop result := result || '; ' || p(i); end loop;
  return(substr(result,2));
end;
t(id, name, attr, val) as
( select 1, 'Ford',  'A', 'B' from dual union all
  select 1, 'Ford',  'B', 'C' from dual union all
  select 1, 'Ford',  'C', 'D' from dual union all
  select 2, 'BMW',   'A', 'B' from dual union all
  select 2, 'BMW',   'C', 'D' from dual union all
  select 2, 'BMW',   'F', 'G' from dual union all
  select 3, 'TESLA', 'Z', 'Y' from dual union all
  select 3, 'TESLA', 'E', 'F' from dual union all
  select 3, 'TESLA', 'A', 'B' from dual)
, t0 as
(select id, name, 
        cast(collect(cast(attr||':'||val as varchar2(4000))) as sys.ku$_vcnt) c
   from t t1
  group by id, name)
select t1.id baseid,
       t1.name basename,
       t2.id tgtid,
       t2.name tgtname,
       collagg(t1.c multiset intersect t2.c) matchedon,
       collagg(t1.c multiset except t2.c) baseonly,
       collagg(t2.c multiset except t1.c) tgtonly
  from t0 t1 join t0 t2 on t1.id < t2.id;

来源：https://stackoverflow.com/questions/54476625/joining-2-data-sets-via-intersection

标签

sql

Oracle

intersection