Postgresql batch insert or ignore

I have the responsibility of switching our code from sqlite to postgres. One of the queries I am having trouble with is copied below.

INSERT INTO group_phones(group_id, phone_name)
SELECT g.id, p.name 
FROM phones AS p, groups as g
WHERE g.id IN ($add_groups) AND p.name IN ($phones);

The problem arises when there is a duplicate record. In this table the combination of both values must be unique. I have used a few plpgsql functions in other places to do update-or-insert operations, but in this case I can do several inserts at once. I am not sure how to write a stored routine for this. Thanks for all the help from all the sql gurus out there!

Erwin Brandstetter

There are 3 challenges.

Your query has no JOIN condition between the tables phones and groups, making this effectively a limited CROSS JOIN - which you most probably do not intend. I.e. every phone that qualifies is combined with every group that qualifies. If you have 100 phones and 100 groups that's already 10,000 combinations.
Insert distinct combinations of (group_id, phone_name)
Avoid inserting rows that are already there in table group_phones .

All things considered it could look like this:

INSERT INTO group_phones(group_id, phone_name)
SELECT i.id, i.name
FROM  (
    SELECT DISTINCT g.id, p.name -- get distinct combinations
    FROM   phones p
    JOIN   groups g ON ??how are p & g connected??
    WHERE  g.id IN ($add_groups)
    AND    p.name IN ($phones)
    ) i
LEFT   JOIN group_phones gp ON (gp.group_id, gp.phone_name) = (i.id, i.name)
WHERE  gp.group_id IS NULL  -- avoid duping existing rows

Concurrency

This form minimizes the chance of a race condition with concurrent write operations. If your table has heavy concurrent write load, you may want to lock the table exclusively or use serializable transaction isolation, This safeguard against the extremely unlikely case that a row is altered by a concurrent transaction in the tiny time slot between the constraint verification (row isn't there) and the write operation in the query.

BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT ...
COMMIT;

Be prepared to repeat the transaction if it rolls back with a serialization error. For more on that topic good starting points could be this blog post by @depesz or this related question on SO.

Normally, though, you needn't even bother with any of this.

Performance

LEFT JOIN tbl ON right_col = left_col WHERE right_col IS NULL

is generally the fastest method with distinct columns in the right table. If you have dupes in the column (especially if there are many),

WHERE NOT EXISTS (SELECT 1 FROM tbl WHERE right_col = left_col)

May be faster because it can stop to scan as soon as the first row is found.

You can also use IN, like @dezso demonstrates, but it is usually slower in PostgreSQL.

Try the following:

INSERT INTO group_phones(group_id, phone_name)
SELECT DISTINCT g.id, p.name 
FROM phones AS p, groups as g
WHERE 
    g.id IN ($add_groups) 
    AND p.name IN ($phones)
    AND (g.id, p.name) NOT IN (
        SELECT group_id, phone_name
        FROM group_phones
    )
;

With DISTINCT you can be sure that unique rows will be inserted, and with the NOT IN clause you exclude already existing rows.

NOTE While this solution is probably easier to understand, in most cases Erwin's will perform better.

来源：https://stackoverflow.com/questions/11890599/postgresql-batch-insert-or-ignore

标签

sql

postgresql

plpgsql

postgresql-9.1

duplicate-removal