Group by sql query on comma joined column

后端 未结 6 1299
孤独总比滥情好
孤独总比滥情好 2020-12-10 22:57

My table structure is like below, "Mail" column can contain multiple email joined by comma

Data(int)

Mail(v

相关标签:
6条回答
  • 2020-12-10 23:13

    SQL Server Using a recursive cte.

    declare @Mail table (ID int, Mail varchar(200))
    
    insert into @Mail values
    (1, 'm1@gmail.com,m2@hotmail.com'),
    (2, 'm2@hotmail.com,m3@test.com'),
    (3, 'm2@hotmail.com')
    
    ;with cte1 as
    (
      select Mail+',' as Mail
      from @Mail
    ),
    cte2
    as
    (
      select
        left(Mail, charindex(',', Mail)-1) as Mail1,
        right(Mail, len(Mail)-charindex(',', Mail)) as Mail
      from cte1
      union all
      select
        left(Mail, charindex(',', Mail)-1) as Mail1,
        right(Mail, len(Mail)-charindex(',', Mail)) as Mail
      from cte2
      where charindex(',', Mail) > 1
    )
    select
      Mail1 as Mail,
      count(*) as [Count]
    from cte2
    group by Mail1
    

    Edit 1 Same as before but handles the case where there is only one email in Mail

    0 讨论(0)
  • 2020-12-10 23:13

    The correct thing to do would be to add a related table to store multiple emails. It is virtually always a bad design decision to store things in a comma delimited list as you have found in trying to query it. This generally means you need to create a related table as you have a one-to-many relationship. The task you want to do is trivial if you have properly related tables.

    I don't buy I can't change the table structure as an excuse. Unless this is a commercial product that your company doesn't own, you can change the structure, you just need to show management why it is necessary. Someone at your organization can change the database structure, find out who and convince him as to why it needs to change. If it is a commercial database, consider creating a trigger on the tble to populate a realted table that you create every time the email field is inserted updated or deleted. Then at least you only have to go through the splitting process once for each record change rather than every time the query is run.

    0 讨论(0)
  • 2020-12-10 23:15

    This is a supplementary answer to show the performance of the various options:

    Fill a sample table with some data

    create table tmp1 ([Data] int, [Mail] varchar(200))
    insert tmp1 SELECT 1,'m1@gmail.com,m2@hotmail.com,other, longer@test, fifth'
    UNION ALL   SELECT 2,'m2@hotmail.com,m3@test.com'
    UNION ALL   SELECT 3,'m3@single.com'
    UNION ALL   SELECT 4,''
    UNION ALL   SELECT 5,null
    
    insert tmp1
    select data*10000 + number, mail
    from tmp1, master..spt_values v
    where v.type='P'
    
    -- total rows: 10245
    

    Test query:

    set statistics io on
    set statistics time on
    
    dbcc dropcleanbuffers dbcc freeproccache
    
    select single, count(*) [Count]
    from
    (
        select ltrim(rtrim(substring(t.mail, v.number+1,
            isnull(nullif(charindex(',',t.mail,v.number+1),0)-v.number-1,200)))) single
        from tmp1 t
        inner join master..spt_values v on v.type='p'
            and v.number <= len(t.Mail)
            and (substring(t.mail,v.number,1) = ',' or v.number=0)
    ) X
    group by single
    
    dbcc dropcleanbuffers dbcc freeproccache
    
    ;with cte1 as
    (
      select Mail+',' as Mail
      from tmp1
    ),
    cte2
    as
    (
      select
        left(Mail, charindex(',', Mail)-1) as Mail1,
        right(Mail, len(Mail)-charindex(',', Mail)) as Mail
      from cte1
      union all
      select
        left(Mail, charindex(',', Mail)-1) as Mail1,
        right(Mail, len(Mail)-charindex(',', Mail)) as Mail
      from cte2
      where charindex(',', Mail) > 1
    )
    select
      Mail1 as Mail,
      count(*) as [Count]
    from cte2
    group by Mail1
    
    dbcc dropcleanbuffers dbcc freeproccache
    
    --SET ANSI_DEFAULTS ON
    --SET ANSI_NULLS ON
    ;
    SELECT address  AS Mail,
           COUNT(*) AS [Count]
    FROM   tmp1
           CROSS APPLY (SELECT CAST('<m>' + REPLACE([Mail], ',', '</m><m>') + '</m>'
                                    AS XML
                               ) AS x) ca1
           CROSS APPLY (SELECT T.split.value('.', 'varchar(200)') AS address
                        FROM   x.nodes('/m') T(split)) ca
    GROUP  BY address  
    

    Statistics

    Run a few time to get a feel for the averages

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'spt_values'. Scan count 8196, logical reads 26637, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tmp1'. Scan count 3, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    
     SQL Server Execution Times:
       CPU time = 641 ms,  elapsed time = 412 ms.
    
    Table 'Worktable'. Scan count 2, logical reads 103271, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tmp1'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    
     SQL Server Execution Times:
       CPU time = 609 ms,  elapsed time = 614 ms.
    
    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tmp1'. Scan count 3, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    
     SQL Server Execution Times:
       CPU time = 2798 ms,  elapsed time = 1421 ms.
    
    Table 'Worktable'. Scan count 2, logical reads 103334, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'tmp1'. Scan count 1, logical reads 43, physical reads 0, read-ahead reads 14, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    
     SQL Server Execution Times:
       CPU time = 734 ms,  elapsed time = 742 ms.
    

    Summary

    First (CHARINDEX) : CPU time = 344 ms, elapsed time = 198 ms.
    Second (CTE) : CPU time = 594 ms, elapsed time = 613 ms.
    Third (XML) : CPU time = 2812 ms, elapsed time = 1418 ms.
    Fourth (CTE2) : CPU time = 719 ms, elapsed time = 750 ms.

    0 讨论(0)
  • 2020-12-10 23:19

    A SQL Server Solution

    WITH T ([Data], [Mail])
         AS (SELECT 1,'m1@gmail.com,m2@hotmail.com' UNION ALL
             SELECT 2,'m2@hotmail.com,m3@test.com')
    SELECT address  AS Mail,
           COUNT(*) AS [Count]
    FROM   T
           CROSS APPLY (SELECT CAST('<m>' + REPLACE([Mail], ',', '</m><m>') + '</m>'
                                    AS XML
                               ) AS x) ca1
           CROSS APPLY (SELECT T.split.value('.', 'varchar(200)') AS address
                        FROM   x.nodes('/m') T(split)) ca
    GROUP  BY address  
    
    0 讨论(0)
  • 2020-12-10 23:23

    String splitting is faster using only CHARINDEX without XML or CTE.

    Sample table

    create table #tmp ([Data] int, [Mail] varchar(200))
    insert #tmp SELECT 1,'m1@gmail.com,m2@hotmail.com,other, longer@test, fifth'
    UNION ALL   SELECT 2,'m2@hotmail.com,m3@test.com'
    UNION ALL   SELECT 3,'m3@single.com'
    UNION ALL   SELECT 4,''
    UNION ALL   SELECT 5,null
    

    The query

    select single, count(*) [Count]
    from
    (
        select ltrim(rtrim(substring(t.mail, v.number+1,
            isnull(nullif(charindex(',',t.mail,v.number+1),0)-v.number-1,200)))) single
        from #tmp t
        inner join master..spt_values v on v.type='p'
            and v.number <= len(t.Mail)
            and (substring(t.mail,v.number,1) = ',' or v.number=0)
    ) X
    group by single
    

    The only parts you supply are

    • #tmp: your table name
    • #mail: the column name
    0 讨论(0)
  • 2020-12-10 23:28

    Very similar to Mikael's answer with minor tweaks...
    - Have a field with a 'cached' LEN to avoid having to repeatedly count the length
    - Use only one UNION each recursion by replacing a 0 CHARINDEX with NULL

    These differences will only really show noticably for long lists, and so with several levels of recursion.


    The CROSS APPLY business is just to make the SELECT more tidy, rather than repeat the NULLIF(CHARINDEX) loads of times.


    WITH
      source (
        Data,
        Mail
      )
    AS
    (
      SELECT 1,'m1@gmail.com,m2@hotmail.com' UNION ALL
      SELECT 2,'m2@hotmail.com,m3@test.com'
    )
    ,
      split_cte
    AS
    (
      SELECT
        LEFT (mail, ISNULL(comma - 1, LEN(mail)))     AS "current_mail",
        RIGHT(mail, ISNULL(LEN(mail) - comma, 0))     AS "mail_data",
        ISNULL(LEN(mail) - comma, 0)                  AS "chars"
      FROM
        source
      CROSS APPLY
        (SELECT NULLIF(CHARINDEX(',', mail), 0) AS "comma") AS search
    
      UNION ALL
    
      SELECT
        LEFT (mail_data, ISNULL(comma - 1, chars))    AS "current_mail",
        RIGHT(mail_data, ISNULL(chars - comma, 0))    AS "mail_data",
        ISNULL(chars - comma, 0)                      AS "chars"
      FROM
        split_cte
      CROSS APPLY
        (SELECT NULLIF(CHARINDEX(',', mail_data), 0) AS "comma") AS search
      WHERE
        chars > 0
    )
    
    SELECT
      current_mail     AS "Mail",
      COUNT(*)         AS "Count"
    FROM
      split_cte
    GROUP BY
      current_mail
    
    0 讨论(0)
提交回复
热议问题