问题
I have a table transcription
which contains passages of transcribed text and their citations with columns:
text, transcription_id(PK), t_notes, citation
and the second table town_transcription
being the relationship table that links places (from another table) referenced in the text to that transcription record. This table has the columns:
town_id(FK), transcription_id(FK), confidence_interval
Many of these passages of text reference multiple towns, but stupidly I just duplicated records and linked them individually to each town. I have identified the duplicate rows of text using the following SQL query:
SELECT * FROM transcription aa
WHERE (select count(*) from transcription bb
WHERE (bb.text = aa.text) AND (bb.citation = aa.citation)) > 1
ORDER BY text ASC;
I now have about 2000 rows (2 to 6 duplicates of some text passages) where I need to delete the extra transcription_id
's from the transcription
table and change the transcription_id
from the relationship table, town_transcription
, to point to the remaining, now unique, transcription record. From reading other questions, I think utilizing UPDATE FROM
and INNER JOIN
might be necessary, but I really don't know how to implement this, I'm just a beginner, thanks for any help.
回答1:
Use row_number() over(...)
to identify rows that repeat information. A partition by text, citation
in the over clause will force the row number series to re-start at 1 for each unique set of those values:
select
*
from (
select
text, transcription_id, t_notes, citation
, row_number() over(partition by text, citation
order by transcription_id) as rn
from transcription
) d
where rn > 1
Once you have verified those as the unwanted rows,then use the same logic for a delete statement.
However, you may loose information held in the t_notes column - are you willing to do that?
回答2:
This single command should do it all:
WITH blacklist AS ( -- identify duplicate IDs and their master
SELECT *
FROM (
SELECT transcription_id
, min(transcription_id) OVER (PARTITION BY text, citation) AS master_id
FROM transcription
) sub
WHERE transcription_id <> master_id
)
, upd AS ( -- redirect referencing rows
UPDATE town_transcription tt
SET transcription_id = b.master_id
FROM blacklist b
WHERE b.transcription_id = tt.transcription_id
)
DELETE FROM transcription t -- kill dupes (now without reference)
USING blacklist b
WHERE b.transcription_id = t.transcription_id;
For lack of definition I chose the row with the smallest ID per group as surviving master row.
FK constraints don't get in the way unless you have non-default settings. Detailed explanation:
- How to remove duplicate rows with foreign keys dependencies?
- Delete duplicates and reroute referencing rows to new master
After removing the dupes you might now want to add a UNIQUE
constraint to prevent the same error from reoccurring:
ALTER TABLE transcription
ADD CONSTRAINT transcription_uni UNIQUE (text, citation);
来源:https://stackoverflow.com/questions/53366008/remove-duplicates-from-a-table-and-re-link-referencing-rows-to-the-new-master