Group All Related Records in Many to Many Relationship, SQL graph connected components

后端 未结 6 878
遥遥无期
遥遥无期 2020-12-08 21:48

Hopefully I\'m missing a simple solution to this.

I have two tables. One contains a list of companies. The second contains a list of publishers. The mapping between

相关标签:
6条回答
  • 2020-12-08 22:25

    Your problem is a graph-walking problem of finding connected subgraphs. It is a little more challenging because your data structure has two types of nodes ("companies" and "pubishers") rather than one type.

    You can solve this with a single recursive CTE. The logic is as follows.

    First, convert the problem into a graph with only one type of node. I do this by making the nodes companies and the edges linkes between companies, using the publisher information. This is just a join:

          select t1.company as node1, t2.company as node2
          from table1 t1 join
               table1 t2
               on t1.publisher = t2.publisher
         )
    

    (For efficiency sake, you could also add t1.company <> t2.company but that is not strictly necessary.)

    Now, this is a "simple" graph walking problem, where a recursive CTE is used to create all connections between two nodes. The recursive CTE walks through the graph using join. Along the way, it keeps a list of all nodes visited. In SQL Server, this needs to be stored in a string.

    The code needs to ensure that it doesn't visit a node twice for a given path, because this can result in infinite recursion (and an error). If the above is called edges, the CTE that generates all pairs of connected nodes looks like:

         cte as (
          select e.node1, e.node2, cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
                 1 as level
          from edges e
          union all
          select c.node1, e.node2, c.nodes+e.node2+'|', 1+c.level
          from cte c join
               edges e
               on c.node2 = e.node1 and
                  c.nodes not like '|%'+e.node2+'%|'
         )
    

    Now, with this list of connected nodes, assign each node the minimum of all the nodes it is connected to, including itself. This serves as an identifier of connected subgraphs. That is, all companies connected to each other via the publishers will have the same minimum.

    The final two steps are to enumerate this minimum (as the GroupId) and to join the GroupId back to the original data.

    The full (and I might add tested) query looks like:

    with edges as (
          select t1.company as node1, t2.company as node2
          from table1 t1 join
               table1 t2
               on t1.publisher = t2.publisher
         ),
         cte as (
          select e.node1, e.node2,
                 cast('|'+e.node1+'|'+e.node2+'|' as varchar(max)) as nodes,
                 1 as level
          from edges e
          union all
          select c.node1, e.node2,
                 c.nodes+e.node2+'|',
                 1+c.level
          from cte c join
               edges e
               on c.node2 = e.node1 and
                  c.nodes not like '|%'+e.node2+'%|'
         ),
         nodes as (
           select node1,
                  (case when min(node2) < node1 then min(node2) else node1 end
                  ) as grp
           from cte
           group by node1
          )
    select t.company, t.publisher, grp.GroupId
    from table1 t join
         (select n.node1, dense_rank() over (order by grp) as GroupId
          from nodes n
         ) grp
         on t.company = grp.node1;
    

    Note that this works on finding any connected subgraphs. It does not assume that any particular number of levels.

    EDIT:

    The question of performance for this is vexing. At a minimum, the above query will run better with an index on Publisher. Better yet is to take @MikaelEriksson's suggestion, and put the edges in a separate table.

    Another question is whether you look for equivalency classes among the Companies or the Publishers. I took the approach of using Companies, because I think that has better "explanability" (my inclination to respond was based on numerous comments that this could not be done with CTEs).

    I am guessing that you could get reasonable performance from this, although that requires more knowledge of your data and system than provided in the OP. It is quite likely, though, that the best performance will come from a multiple query approach.

    0 讨论(0)
  • 2020-12-08 22:32

    Bit late to the challenge, and since SQLFiddle seems to be down ATM I'll have to guess your data-structures. Nevertheless, it seemed like a fun challenge (and it was =) so here's what I made from it :

    Setup:

    IF OBJECT_ID('t_link') IS NOT NULL DROP TABLE t_link
    IF OBJECT_ID('t_company') IS NOT NULL DROP TABLE t_company
    IF OBJECT_ID('t_publisher') IS NOT NULL DROP TABLE t_publisher
    IF OBJECT_ID('tempdb..#link_A') IS NOT NULL DROP TABLE #link_A
    IF OBJECT_ID('tempdb..#link_B') IS NOT NULL DROP TABLE #link_B
    GO
    
    CREATE TABLE t_company ( company_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                             company_name   varchar(100) NOT NULL)
    
    GO 
    
    CREATE TABLE t_publisher (publisher_id     int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
                              publisher_name   varchar(100) NOT NULL)
    
    CREATE TABLE t_link (company_id int NOT NULL FOREIGN KEY (company_id) REFERENCES t_company (company_id),
                         publisher_id int NOT NULL FOREIGN KEY (publisher_id) REFERENCES t_publisher (publisher_id),
                                    PRIMARY KEY (company_id, publisher_id),
                         group_id int NULL
                                 )
    GO
    
    -- example content
    
    
    -- ROW   GROUPID     Company     Publisher
    --1     1           A           Y
    --2     1           A           X
    --3     1           B           Y
    --4     1           B           Z
    --5     2           C           W
    --6     2           C           P
    --7     2           D           W
    
    
    INSERT t_company (company_name) VALUES ('A'), ('B'), ('C'), ('D')
    INSERT t_publisher (publisher_name) VALUES ('X'), ('Y'), ('Z'), ('W'), ('P')
    
    INSERT t_link (company_id, publisher_id)
    SELECT company_id, publisher_id
      FROM t_company, t_publisher
     WHERE (company_name = 'A' AND publisher_name = 'Y')
        OR (company_name = 'A' AND publisher_name = 'X')
        OR (company_name = 'B' AND publisher_name = 'Y')
        OR (company_name = 'B' AND publisher_name = 'Z')
        OR (company_name = 'C' AND publisher_name = 'W')
        OR (company_name = 'C' AND publisher_name = 'P')
        OR (company_name = 'D' AND publisher_name = 'W')
    
    
    
    
    GO
    
    /*
    -- volume testing
    
    TRUNCATE TABLE t_link
    DELETE t_company
    DELETE t_publisher
    
    
    DECLARE @company_count   int = 1000,
            @publisher_count int = 450,
            @links_count     int = 800
    
    
    INSERT t_company (company_name)
    SELECT company_name    = Convert(varchar(100), NewID())
      FROM master.dbo.fn_int_list(1, @company_count) 
    
    UPDATE STATISTICS t_company
    
    INSERT t_publisher (publisher_name)
    SELECT publisher_name  = Convert(varchar(100), NewID())
      FROM master.dbo.fn_int_list(1, @publisher_count) 
    
    UPDATE STATISTICS t_publisher
    
    -- Random links between the companies & publishers
    
    DECLARE @count int
    SELECT @count = 0
    
    WHILE @count < @links_count
        BEGIN
    
            SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), company_id = company_id + 0
              INTO #link_A
              FROM t_company
             ORDER BY NewID()
    
            SELECT TOP 30 PERCENT row_id = IDENTITY(int, 1, 1), publisher_id = publisher_id + 0
              INTO #link_B
              FROM t_publisher
             ORDER BY NewID()
    
            INSERT TOP (@links_count - @count) t_link (company_id, publisher_id)
            SELECT A.company_id,
                   B.publisher_id
              FROM #link_A A
              JOIN #link_B B
                ON A.row_id = B.row_id
             WHERE NOT EXISTS ( SELECT *
                                  FROM t_link old
                                 WHERE old.company_id   = A.company_id
                                   AND old.publisher_id = B.publisher_id)
    
            SELECT @count = @count + @@ROWCOUNT
    
            DROP TABLE #link_A
            DROP TABLE #link_B    
        END
    
    */
    

    Actual grouping:

    IF OBJECT_ID('tempdb..#links') IS NOT NULL DROP TABLE #links
    GO
    
    -- apply grouping
    
    -- init
    SELECT row_id = IDENTITY(int, 1, 1), 
           company_id,
           publisher_id,
           group_id = 0
      INTO #links
      FROM t_link
    
    -- don't see an index that would be actually helpful here right-away, using row_id to avoid HEAP
    CREATE CLUSTERED INDEX idx0 ON #links (row_id)
    --CREATE INDEX idx1 ON #links (company_id)   
    --CREATE INDEX idx2 ON #links (publisher_id)
    
    UPDATE #links
       SET group_id = row_id
    
    
    -- start grouping
    WHILE @@ROWCOUNT > 0
        BEGIN  
            UPDATE #links
               SET group_id = new_group_id
              FROM #links upd
              CROSS APPLY (SELECT new_group_id = Min(group_id)
                             FROM #links new
                            WHERE new.company_id   = upd.company_id
                               OR new.publisher_id = upd.publisher_id 
                                         ) x
            WHERE upd.group_id > new_group_id
    
            -- select * from #links
        END
    
    
    -- remove 'holes'
    UPDATE #links
       SET group_id = (SELECT COUNT(DISTINCT o.group_id) 
                              FROM #links o
                             WHERE o.group_id <= upd.group_id)
      FROM #links upd
    
    GO
    
    UPDATE t_link
       SET group_id = new.group_id
      FROM t_link upd
      LEFT OUTER JOIN #links new
                   ON new.company_id = upd.company_id
                  AND new.publisher_id = upd.publisher_id
    
    GO    
    SELECT row = ROW_NUMBER() OVER (ORDER BY group_id, company_name, publisher_name),
           l.group_id,
           c.company_name, -- c.company_id,
           p.publisher_name -- , p.publisher_id
     from t_link l
     JOIN t_company c
       ON l.company_id = c.company_id
     JOIN t_publisher p 
       ON p.publisher_id = l.publisher_id
     ORDER BY 1
    

    At first sight this approach hasn't been tried yet by anyone else, interesting to see how this can be done in a variety of ways... (preferred not to read them upfront as it would spoil the puzzle =)

    Results look as expected (as far as I understand the requirements and the example) and performance isn't too shabby either although there is no real indication on the amount of records this should work on; not sure how it would scale but don't expect too many problems either...

    0 讨论(0)
  • 2020-12-08 22:35

    I thought about using recursive CTE, but, as far as I know, it's not possible in SQL Server to use UNION to connect anchor member and a recursive member of recursive CTE (I think it's possible to do in PostgreSQL), so it's not possible to eliminate duplicates.

    declare @i int
    
    with cte as (
         select
             GroupID,
             row_number() over(order by Company) as rn
         from Table1
    )
    update cte set GroupID = rn
    
    select @i = @@rowcount
    
    -- while some rows updated
    while @i > 0
    begin
        update T1 set
            GroupID = T2.GroupID
        from Table1 as T1
            inner join (
                select T2.Company, min(T2.GroupID) as GroupID
                from Table1 as T2
                group by T2.Company
            ) as T2 on T2.Company = T1.Company
        where T1.GroupID > T2.GroupID
    
        select @i = @@rowcount
    
        update T1 set
            GroupID = T2.GroupID
        from Table1 as T1
            inner join (
                select T2.Publisher, min(T2.GroupID) as GroupID
                from Table1 as T2
                group by T2.Publisher
            ) as T2 on T2.Publisher = T1.Publisher
        where T1.GroupID > T2.GroupID
    
        -- will be > 0 if any rows updated
        select @i = @i + @@rowcount
    end
    
    ;with cte as (
         select
             GroupID,
             dense_rank() over(order by GroupID) as rn
         from Table1
    )
    update cte set GroupID = rn
    

    sql fiddle demo

    I've also tried a breadth first search algorithm. I thought it could be faster (it's better in terms of complexity), so I'll provide a solution here. I've found that it's not faster than SQL approach, though:

    declare @Company nvarchar(2), @Publisher nvarchar(2), @GroupID int
    
    declare @Queue table (
        Company nvarchar(2), Publisher nvarchar(2), ID int identity(1, 1),
        primary key(Company, Publisher)
    )
    
    select @GroupID = 0
    
    while 1 = 1
    begin
        select top 1 @Company = Company, @Publisher = Publisher
        from Table1
        where GroupID is null
    
        if @@rowcount = 0 break
    
        select @GroupID = @GroupID + 1
    
        insert into @Queue(Company, Publisher)
        select @Company, @Publisher
    
        while 1 = 1
        begin
            select top 1 @Company = Company, @Publisher = Publisher
            from @Queue
            order by ID asc
    
            if @@rowcount = 0 break
    
            update Table1 set
                GroupID = @GroupID
            where Company = @Company and Publisher = @Publisher
    
            delete from @Queue where Company = @Company and Publisher = @Publisher
    
            ;with cte as (
                select Company, Publisher from Table1 where Company = @Company and GroupID is null
                union all
                select Company, Publisher from Table1 where Publisher = @Publisher and GroupID is null
            )
            insert into @Queue(Company, Publisher)
            select distinct c.Company, c.Publisher
            from cte as c
            where not exists (select * from @Queue as q where q.Company = c.Company and q.Publisher = c.Publisher)
       end
    end
    

    sql fiddle demo

    I've tested my version and Gordon Linoff's to check how it's perform. It looks like CTE is much worse, I couldn't wait while it's complete on more than 1000 rows.

    Here's sql fiddle demo with random data. My results were:
    128 rows:
    my RBAR solution: 190ms
    my SQL solution: 27ms
    Gordon Linoff's solution: 958ms
    256 rows:
    my RBAR solution: 560ms
    my SQL solution: 1226ms
    Gordon Linoff's solution: 45371ms

    It's random data, so results may be not very consistent. I think timing could be changed by indexes, but don't think it could change a whole picture.

    old version - using temporary table, just calculating GroupID without touching initial table:

    declare @i int
    
    -- creating table to gather all possible GroupID for each row
    create table #Temp
    (
        Company varchar(1), Publisher varchar(1), GroupID varchar(1),
        primary key (Company, Publisher, GroupID)
    )
    
    -- initializing it with data
    insert into #Temp (Company, Publisher, GroupID)
    select Company, Publisher, Company
    from Table1
    
    select @i = @@rowcount
    
    -- while some rows inserted into #Temp
    while @i > 0
    begin
        -- expand #Temp in both directions
        ;with cte as (
            select
                T2.Company, T1.Publisher,
                T1.GroupID as GroupID1, T2.GroupID as GroupID2
            from #Temp as T1
                inner join #Temp as T2 on T2.Company = T1.Company
            union
            select
                T1.Company, T2.Publisher,
                T1.GroupID as GroupID1, T2.GroupID as GroupID2
            from #Temp as T1
                inner join #Temp as T2 on T2.Publisher = T1.Publisher        
        ), cte2 as (
            select
                Company, Publisher,
                case when GroupID1 < GroupID2 then GroupID1 else GroupID2 end as GroupID
            from cte
        )
        insert into #Temp
        select Company, Publisher, GroupID
        from cte2
        -- don't insert duplicates
        except
        select Company, Publisher, GroupID
        from #Temp
    
        -- will be > 0 if any row inserted
        select @i = @@rowcount
    end
    
    select
        Company, Publisher,
        dense_rank() over(order by min(GroupID)) as GroupID
    from #Temp
    group by Company, Publisher
    

    => sql fiddle example

    0 讨论(0)
  • 2020-12-08 22:36

    You are trying to find all of the connected components of your graph, which can only be done iteratively. If you know the maximum width of any connected component (i.e. the maximum number of links you will have to take from one company/publisher to another), you could in principle do it something like this:

    SELECT
        MIN(x2.groupID) AS groupID,
        x1.Company,
        x1.Publisher
    FROM Table1 AS x1
        INNER JOIN (
            SELECT
                MIN(x2.Company) AS groupID,
                x1.Company,
                x1.Publisher
            FROM Table1 AS x1
                INNER JOIN Table1 AS x2
                ON x1.Publisher = x2.Publisher
            GROUP BY
                x1.Publisher,
                x1.Company
        ) AS x2
        ON x1.Company = x2.Company
    GROUP BY
        x1.Publisher,
        x1.Company;
    

    You have to keep nesting the subquery (alternating joins on Company and Publisher, and with the deepest subquery saying MIN(Company) rather than MIN(groupID)) to the maximum iteration depth.

    I don't really recommend this, though; it would be cleaner to do this outside of SQL.

    Disclaimer: I don't know anything about SQL Server 2012 (or any other version); it may have some kind of additional scripting ability to let you do this iteration dynamically.

    0 讨论(0)
  • 2020-12-08 22:42

    Here is my solution SQL Fiddle

    The nature of the relationships require looping as I figure.


    Here is the SQL:

    --drop TABLE Table1
    
    CREATE TABLE Table1
        ([row] int identity (1,1),GroupID INT NULL,[Company] varchar(2), [Publisher] varchar(2))
    ;
    
    INSERT INTO Table1
        (Company, Publisher)
    select
        left(newid(), 2), left(newid(), 2)
    
    declare @i int = 1
    
    while @i < 8
    begin
        ;with cte(Company, Publisher) as (
            select
                left(newid(), 2), left(newid(), 2)
            from Table1
        )
        insert into Table1(Company, Publisher)
        select distinct c.Company, c.Publisher
        from cte as c
        where not exists (select * from Table1 as t where t.Company = c.Company and t.Publisher = c.Publisher)
    
        set @i = @i + 1
    end;
    
    
    CREATE NONCLUSTERED INDEX IX_Temp1 on Table1 (Company)
    CREATE NONCLUSTERED INDEX IX_Temp2 on Table1 (Publisher)
    
    declare @counter int=0
    declare @row int=0
    declare @lastnullcount int=0
    declare @currentnullcount int=0
    
    WHILE EXISTS (
      SELECT *
      FROM Table1
      where GroupID is null
      )
    BEGIN
        SET @counter=@counter+1
        SET @lastnullcount =0
    
        SELECT TOP 1
            @row=[row]
        FROM Table1
        where GroupID is null
        order by [row] asc
    
        SELECT @currentnullcount=count(*) from table1 where groupid is null
        WHILE @lastnullcount <> @currentnullcount
        BEGIN
            SELECT @lastnullcount=count(*)
            from table1
            where groupid is null 
    
            UPDATE Table1
            SET GroupID=@counter
            WHERE [row]=@row
    
            UPDATE t2
            SET t2.GroupID=@counter
            FROM Table1 t1
            INNER JOIN Table1 t2 on t1.Company=t2.Company
            WHERE t1.GroupID=@counter
            AND t2.GroupID IS NULL
    
            UPDATE t2
            SET t2.GroupID=@counter
            FROM Table1 t1
            INNER JOIN Table1 t2 on t1.publisher=t2.publisher
            WHERE t1.GroupID=@counter
            AND t2.GroupID IS NULL
    
            SELECT @currentnullcount=count(*)
            from table1
            where groupid is null
        END
    END
    
    SELECT * FROM Table1
    

    Edit: Added indexes as I would expect on the real table and be more in line with the other data sets Roman is using.

    0 讨论(0)
  • 2020-12-08 22:47

    This is a recursive solution, using XML:

    with a as ( -- recursive result, containing shorter subsets and duplicates
        select cast('<c>' + company + '</c>' as xml) as companies
              ,cast('<p>' + publisher + '</p>' as xml) as publishers
          from Table1
    
        union all
    
        select a.companies.query('for $c in distinct-values((for $i in /c return string($i),
                                                            sql:column("t.company")))
                              order by $c
                              return <c>{$c}</c>')
              ,a.publishers.query('for $p in distinct-values((for $i in /p return string($i),
                                                             sql:column("t.publisher")))
                              order by $p
                              return <p>{$p}</p>')
        from a join Table1 t
          on (   a.companies.exist('/c[text() = sql:column("t.company")]') = 0 
              or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 0)
         and (   a.companies.exist('/c[text() = sql:column("t.company")]') = 1
              or a.publishers.exist('/p[text() = sql:column("t.publisher")]') = 1)
    ), b as ( -- remove the shorter versions from earlier steps of the recursion and the duplicates
        select distinct -- distinct cannot work on xml types, hence cast to nvarchar
               cast(companies as nvarchar) as companies
              ,cast(publishers as nvarchar) as publishers
              ,DENSE_RANK() over(order by cast(companies as nvarchar), cast(publishers as nvarchar)) as groupid
         from a
        where not exists (select 1 from a as s -- s is a proper subset of a
                           where (cast('<s>' + cast(s.companies as varchar)
                                     + '</s><a>' + cast(a.companies as varchar) + '</a>' as xml)
                                 ).value('if((count(/s/c) > count(/a/c))
                                             and (some $s in /s/c/text() satisfies
                                                 (some $a in /a/c/text() satisfies $s = $a))
                                          ) then 1 else 0', 'int') = 1
                         )
          and not exists (select 1 from a as s -- s is a proper subset of a
                           where (cast('<s>' + cast(s.publishers as nvarchar)
                                     + '</s><a>' + cast(a.publishers as nvarchar) + '</a>' as xml)
                                 ).value('if((count(/s/p) > count(/a/p))
                                             and (some $s in /s/p/text() satisfies
                                                 (some $a in /a/p/text() satisfies $s = $a))
                                          ) then 1 else 0', 'int') = 1
                         )
    ), c as (  -- cast back to xml
        select cast(companies as xml) as companies
              ,cast(publishers as xml) as publishers
              ,groupid
          from b
    )
    select Co.company.value('(./text())[1]', 'varchar') as company
          ,Pu.publisher.value('(./text())[1]', 'varchar') as publisher
          ,c.groupid
      from c
           cross apply companies.nodes('/c') as Co(company)
           cross apply publishers.nodes('/p') as Pu(publisher)
     where exists(select 1 from Table1 t -- restrict to only the combinations that exist in the source
                   where t.company = Co.company.value('(./text())[1]', 'varchar')
                     and t.publisher = Pu.publisher.value('(./text())[1]', 'varchar')
                 )
    

    The set of companies and the set of publishers are kept in XML fields in the intermediate steps, and there is some casting between xml and nvarchar necessary due to some limitations of SQL Server (like not being able to group or use distinct on XML columns.

    0 讨论(0)
提交回复
热议问题