Duplicates removal using Group By, Rank, Row_Number

问题

I have two tables. One is CustomerOrders and the other is OrderCustomerRef - lookup table.

Both tables have one-to-many relationship - one customer may be associated with multiple orders.

CustomerOrders table has duplicate Customers (same LName, FName, Email). But they have different Cust_IDs.

I need to merge all duplicate contacts in the base Customer table (one-to-one). (this table is not shown here).

Step 1:

Need to find out which Cust_ID should be merged into which corresponding duplicate Customer(s) (same LName, FName, Email). A Contact with latest Order_Date should win over it's corresponding duplicate counterpart (Customer). An exception will be for VIP Customers - they should always be the winning ones regardless of an Order_Date.

Step 2: Updated OrderCustomerRef table: replace all losing duplicate Cust_IDs with the winning Cust_IDs.

Step 3: Delete all losing Contacts from the base Customer table (no in the current scope. I will do it myself).

IF OBJECT_ID('tempdb..#table') IS NOT NULL
DROP TABLE #table;

IF OBJECT_ID('tempdb..#CustomerOrders') IS NOT NULL
DROP TABLE #CustomerOrders;

IF OBJECT_ID('tempdb..#OrderCustomerRef') IS NOT NULL
DROP TABLE #OrderCustomerRef;

CREATE TABLE #CustomerOrders 
(
[PK_ID] INT NOT NULL PRIMARY KEY IDENTITY(1,1),
Cust_ID INT NOT NULL, 
LName VARCHAR(100) NULL, 
FName VARCHAR(100) NULL, 
[Customer_E-mail] VARCHAR(100) NULL,
Order_Date DATETIME NULL,
Customer_Source VARCHAR(100) NULL,
CustomerType VARCHAR(100) NULL
)

INSERT INTO #CustomerOrders (Cust_ID, LName, FName, [Customer_E-mail], Order_Date, Customer_Source, CustomerType)
VALUES 
(1, 'John', 'Smith', 'JSmith@email.com', '2018-11-10 01:40:55.150', 'XYZ Company', 'Regular'),
(2, 'John', 'Smith', 'JSmith@email.com', '2018-10-10 05:05:55.150', 'Internet', 'VIP'),
(3, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'XYZ Company','Regular'),
(3, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'XYZ Company','VIP'),
(4, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'Internet','Regular'),
(5, 'Adam', 'Burns', 'ABurns@email.com', '2017-05-05 00:00:00.000', 'Internet','VIP'),
(6, 'James', 'Snatcher', 'JSnatcher@email.com', '2019-07-07 00:00:00.000', 'XYZ Company', 'Regular'),
(7, 'James', 'Snatcher', 'JSnatcher@email.com', '2019-07-07 00:00:00.000', 'Internet','Regular'),
(9, 'Thomas', 'Johnson', 'TJohnson@email.com', '2016-05-01 00:00:00.000', 'Internet','Regular'),
(9, 'Thomas', 'Johnson', 'TJohnson@email.com', '2015-04-01 00:00:00.000', 'Internet','Regular'),
(10, 'Thomas', 'Johnson', 'TJohnson@email.com', '2014-03-01 00:00:00.000', 'Internet','Regular'),
(11, 'Thomas', 'Johnson', 'TJohnson@email.com', '2013-02-01 00:00:00.000', 'XYZ Company','Regular'),
(12, 'Peter', 'McDonald', 'PMcDonald@email.com', '2013-02-01 00:00:00.000', 'XYZ Company','Regular'),
(13, 'Jose', 'Mainster', 'JMainster@email.com', '2013-02-01 00:00:00.000', 'Internet','Regular'),
(14, 'Kevin', 'Digginton', 'KDigginton@email.com', '2013-02-01 00:00:00.000', 'Internet','Regular'),
(14, 'Kevin', 'Digginton', 'KDigginton@email.com', '2015-09-03 00:00:00.000', 'Internet','Regular')

CREATE TABLE #OrderCustomerRef
(
    Raw_PK INT NOT NULL PRIMARY KEY IDENTITY(1,1),
    OrderID INT NOT NULL, 
    Cust_ID INT NULL, 
    OrderType VARCHAR(100) NULL
)

    INSERT INTO #OrderCustomerRef (OrderID, Cust_ID, OrderType)
    VALUES 
    (1,1,'Online'),
    (2,2,'Online'),
    (3,3,'Online'),
    (4,3,'Online'),
    (5,4,'In Store'),
    (6,5,'Online'),
    (7,6,'Online'),
    (8,7,'In Store'),
    (9,9,'Online'),
    (10,9,'Online'),
    (11,10,'In Store'),
    (12,11,'Online'),
    (13,12,'Online'),
    (14,13,'Online'),
    (15,14,'Online'),
    (16,14,'In Store')

    -- SELECT * FROM #OrderCustomerRef

    SELECT *,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date DESC) AS Rank_1,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date, CustomerType DESC ) AS Rank_CustType,
    RANK() OVER (PARTITION BY Cust_ID, FName, LName, [Customer_E-mail], Customer_Source ORDER BY Order_Date, CustomerType DESC ) AS Rank_CustID,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Order_Date DESC) AS Rank_2,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Cust_ID) AS Rank_3
    FROM #CustomerOrders

DESIRED OUTPUT SHOULD LOOK LIKE:

*exception: - losing Customer IDs 1, 3 (should be winning, but since there is a duplicate counterpart it's a VIP it's losing) - winning Customer IDs 2, 5 (because it's a VIP, subject to exception)

Eg.: All occurences of Cust_ID of John Smith with Cust_ID of 1 in the ##OrderCustomerRef should be replaced with John Smith with Cust_ID of 2, all occurances of Cust_ID of Adam Burns with Cust_ID of 3 should be replaced with Adam Burns with Cust_ID of 5

general rule: - losing Customer IDs 7, 10, 11, 4 - winning Customer IDs 6, 9, 12, 13, 14

Eg.: All occurences of Cust_ID of 7 in the ##OrderCustomerRef should be replaced with 6, all occurances of Cust_ID of 10 should be replaced with 9*

Eventually I should have only Customer IDs 6, 9, 12, 13, 14, 2, 5 in the ##OrderCustomerRef table

Using Rank_CustType_1, column_1, column_2 I can figure out Step 1. But I still have a problem with Step 2 - updating OrderCustomerRef table as such: all losing Cust_IDs should be replaced with corresponding duplicate winning Cust_IDs.

I've tried this. But that still does not replace losing Cust_ID.

SELECT *,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Order_Date, CustomerType DESC) AS Rank_CustType_1,
    RANK() OVER (PARTITION BY FName, LName, [Customer_E-mail] ORDER BY Cust_ID) AS Rank_3
INTO #table
FROM #CustomerOrders

; with cte as (
    select Cust_ID, FName, LName, [Customer_E-mail], max(t.Rank_CustType_1) as Rank_CustType_1
    ,(select distinct Cust_ID from #table a where a.Cust_ID = t.Cust_ID and Rank_3 = 1) column_1
    ,(select distinct Cust_ID from #table a where a.Cust_ID = t.Cust_ID and Rank_3 <> 1) column_2

from #table t
group by Cust_ID, FName, LName, [Customer_E-mail]
    )

    update b
    set Cust_ID = case  
    when b.Cust_ID = cte.Cust_ID and
     b.Cust_ID = ISNULL(cte.column_1,'') and Rank_CustType_1 != 1 then b.Cust_ID 
    when b.Cust_ID = cte.Cust_ID and
     b.Cust_ID = ISNULL(cte.column_2,'') and Rank_CustType_1 != 1 then cte.column_2     
    when b.Cust_ID = cte.Cust_ID and Rank_CustType_1 = 1 and cte.column_1 is null and cte.column_2 is not null then cte.column_2
    when b.Cust_ID = cte.Cust_ID and Rank_CustType_1 = 1 and cte.column_1 is not null and cte.column_2 is null then cte.column_1 
    end  
    from #OrderCustomerRef b
    inner join cte on b.Cust_ID = cte.Cust_ID;

    select * from #OrderCustomerRef;

回答1:

Based on what information you provided, I used the following CTE to show the results that look to get what you want:

WITH DaCTE -- To rank the existing rows
AS  (
        SELECT pk_ID
            , cust_ID
            , fname
            , lname
            , [customer_e-mail]
            , Order_Date
            , Customer_Source
            , customertype
            , ROW_NUMBER() OVER (PARTITION BY fname, lname, [customer_e-mail] ORDER BY customertype DESC, order_date DESC, cust_id) as RankYo -- Orders by the criteria provided but while you suggested 3 should lose to 5, they have the same criteria so either one could win based on ordering
        FROM #customerorders
    )
, NewSource -- To show winning Customer ID next to Original ID
AS  (
        SELECT co.pk_ID
            , DaCTE.cust_ID as NewCustomerID
            , co.cust_ID as OriginalCustomerID
            , co.fname
            , co.lname
            , co.[customer_e-mail]
            , co.Order_Date
            , co.Customer_Source
            , co.customertype
        FROM DaCTE
        INNER JOIN #CustomerOrders as co
            ON co.fname = DaCTE.FName
            AND co.lname = DaCTE.LName
            AND co.[customer_e-mail] = DaCTE.[Customer_E-mail]
        WHERE DaCTE.RankYo = 1 -- filter to show only the winning IDs based on resulting rank from previous CTE
    )
SELECT *
/*UPDATE ocr --commented out so you can see the results before running update
SET ocr.Cust_ID = ns.NewCustomerID*/
FROM #OrderCustomerRef as ocr
INNER JOIN NewSource as ns
    ON ns.OriginalCustomerID = ocr.Cust_ID

回答2:

you can try this using CTE(Common Table Expression) as explained Antoine Hernandez in the above answer and you also can remove duplicates from the table using UNION And EXCEPT Operator.

i.g Using EXCEPT Operator

SELECT * FROM  #customerorders
EXCEPT
SELECT * FROM  #customerorders WHERE 1=0

i.g Using UNION Operator

SELECT * FROM  #customerorders
UNION
SELECT * FROM  #customerorders WHERE 1=0

For More Information about How to Remove Duplicates Using CTE Please follow this link : Remove Duplicates Using CTE

For More Information about How to Remove Duplicates Using UNION And EXCEPT Operator Please follow this link : Remove Duplicates Using UNION And EXCEPT Operator

来源：https://stackoverflow.com/questions/58775282/duplicates-removal-using-group-by-rank-row-number

标签

sql-server

group-by

duplicates

ranking

row-number