Select distinct values from multiple columns in same table

前端 未结 4 1625
[愿得一人]
[愿得一人] 2021-01-01 10:54

I am trying to construct a single SQL statement that returns unique, non-null values from multiple columns all located in the same table.

 SELECT distinct tb         


        
4条回答
  •  無奈伤痛
    2021-01-01 11:18

    It's better to include code in your question, rather than ambiguous text data, so that we are all working with the same data. Here is the sample schema and data I have assumed:

    CREATE TABLE tbl_data (
      id INT NOT NULL,
      code_1 CHAR(2),
      code_2 CHAR(2)
    );
    
    INSERT INTO tbl_data (
      id,
      code_1,
      code_2
    )
    VALUES
      (1, 'AB', 'BC'),
      (2, 'BC', NULL),
      (3, 'DE', 'EF'),
      (4, NULL, 'BC');
    

    As Blorgbeard commented, the DISTINCT clause in your solution is unnecessary because the UNION operator eliminates duplicate rows. There is a UNION ALL operator that does not elimiate duplicates, but it is not appropriate here.

    Rewriting your query without the DISTINCT clause is a fine solution to this problem:

    SELECT code_1
    FROM tbl_data
    WHERE code_1 IS NOT NULL
    UNION
    SELECT code_2
    FROM tbl_data
    WHERE code_2 IS NOT NULL;
    

    It doesn't matter that the two columns are in the same table. The solution would be the same even if the columns were in different tables.

    If you don't like the redundancy of specifying the same filter clause twice, you can encapsulate the union query in a virtual table before filtering that:

    SELECT code
    FROM (
      SELECT code_1
      FROM tbl_data
      UNION
      SELECT code_2
      FROM tbl_data
    ) AS DistinctCodes (code)
    WHERE code IS NOT NULL;
    

    I find the syntax of the second more ugly, but it is logically neater. But which one performs better?

    I created a sqlfiddle that demonstrates that the query optimizer of SQL Server 2005 produces the same execution plan for the two different queries:

    The query optimizer produces this execution plan for both queries: two table scans, a concatenation, a distinct sort, and a select.

    If SQL Server generates the same execution plan for two queries, then they are practically as well as logically equivalent.

    Compare the above to the execution plan for the query in your question:

    The DISTINCT clause makes SQL Server 2005 perform a redundant sort operation.

    The DISTINCT clause makes SQL Server 2005 perform a redundant sort operation, because the query optimizer does not know that any duplicates filtered out by the DISTINCT in the first query would be filtered out by the UNION later anyway.

    This query is logically equivalent to the other two, but the redundant operation makes it less efficient. On a large data set, I would expect your query to take longer to return a result set than the two here. Don't take my word for it; experiment in your own environment to be sure!

提交回复
热议问题