Remove duplicates from a table and re-link referencing rows to the new master

问题

I have a table transcription which contains passages of transcribed text and their citations with columns:

text, transcription_id(PK), t_notes, citation

and the second table town_transcription being the relationship table that links places (from another table) referenced in the text to that transcription record. This table has the columns:

town_id(FK), transcription_id(FK), confidence_interval

Many of these passages of text reference multiple towns, but stupidly I just duplicated records and linked them individually to each town. I have identified the duplicate rows of text using the following SQL query:

SELECT * FROM transcription aa
WHERE (select count(*) from transcription bb
WHERE (bb.text = aa.text) AND (bb.citation = aa.citation)) > 1
ORDER BY text ASC;

I now have about 2000 rows (2 to 6 duplicates of some text passages) where I need to delete the extra transcription_id's from the transcription table and change the transcription_id from the relationship table, town_transcription, to point to the remaining, now unique, transcription record. From reading other questions, I think utilizing UPDATE FROM and INNER JOIN might be necessary, but I really don't know how to implement this, I'm just a beginner, thanks for any help.

回答1:

Use row_number() over(...) to identify rows that repeat information. A partition by text, citation in the over clause will force the row number series to re-start at 1 for each unique set of those values:

select
     *
from (
       select
              text, transcription_id, t_notes, citation
            , row_number() over(partition by text, citation 
                                order by transcription_id) as rn
       from transcription 
     ) d
where rn > 1

Once you have verified those as the unwanted rows,then use the same logic for a delete statement.

However, you may loose information held in the t_notes column - are you willing to do that?

回答2:

This single command should do it all:

WITH blacklist AS (  -- identify duplicate IDs and their master
   SELECT *
   FROM  (
      SELECT transcription_id
           , min(transcription_id) OVER (PARTITION BY text, citation) AS master_id
      FROM   transcription
      ) sub
   WHERE  transcription_id <> master_id
   )
, upd AS (  -- redirect referencing rows
   UPDATE town_transcription tt
   SET    transcription_id = b.master_id
   FROM   blacklist b
   WHERE  b.transcription_id = tt.transcription_id
   )
DELETE FROM transcription t  -- kill dupes (now without reference)
USING  blacklist b
WHERE  b.transcription_id = t.transcription_id;

For lack of definition I chose the row with the smallest ID per group as surviving master row.

FK constraints don't get in the way unless you have non-default settings. Detailed explanation:

How to remove duplicate rows with foreign keys dependencies?
Delete duplicates and reroute referencing rows to new master

After removing the dupes you might now want to add a UNIQUE constraint to prevent the same error from reoccurring:

ALTER TABLE transcription
ADD CONSTRAINT transcription_uni UNIQUE (text, citation);

来源：https://stackoverflow.com/questions/53366008/remove-duplicates-from-a-table-and-re-link-referencing-rows-to-the-new-master

标签

sql

postgresql

duplicates

common-table-expression