I am using bulk_create to loads thousands or rows into a postgresql DB. Unfortunately some of the rows are causing IntegrityError and stoping the bulk_create process. I was
(Note: I don't use Django, so there may be more suitable framework-specific answers)
It is not possible for Django to do this by simply ignoring INSERT
failures because PostgreSQL aborts the whole transaction on the first error.
Django would need one of these approaches:
INSERT
each row in a separate transaction and ignore errors (very slow);TEMPORARY
table, then merge that into the main table server-side.The upsert-like approach (3) seems like a good idea, but upsert and insert-if-not-exists are surprisingly complicated.
Personally, I'd take (4): I'd bulk-insert into a new separate table, probably UNLOGGED
or TEMPORARY
, then I'd run some manual SQL to:
LOCK TABLE realtable IN EXCLUSIVE MODE;
INSERT INTO realtable
SELECT * FROM temptable WHERE NOT EXISTS (
SELECT 1 FROM realtable WHERE temptable.id = realtable.id
);
The LOCK TABLE ... IN EXCLUSIVE MODE prevents a concurrent insert that creates a row from causing a conflict with an insert done by the above statement and failing. It does not prevent concurrent SELECT
s, only SELECT ... FOR UPDATE
, INSERT
,UPDATE
and DELETE
, so reads from the table carry on as normal.
If you can't afford to block concurrent writes for too long you could instead use a writable CTE to copy ranges of rows from temptable
into realtable
, retrying each block if it failed.