Postgresql batch insert or ignore

被刻印的时光 ゝ 提交于 2019-11-30 14:43:49
Erwin Brandstetter

There are 3 challenges.

  1. Your query has no JOIN condition between the tables phones and groups, making this effectively a limited CROSS JOIN - which you most probably do not intend. I.e. every phone that qualifies is combined with every group that qualifies. If you have 100 phones and 100 groups that's already 10,000 combinations.

  2. Insert distinct combinations of (group_id, phone_name)

  3. Avoid inserting rows that are already there in table group_phones .

All things considered it could look like this:

INSERT INTO group_phones(group_id, phone_name)
SELECT i.id, i.name
FROM  (
    SELECT DISTINCT g.id, p.name -- get distinct combinations
    FROM   phones p
    JOIN   groups g ON ??how are p & g connected??
    WHERE  g.id IN ($add_groups)
    AND    p.name IN ($phones)
    ) i
LEFT   JOIN group_phones gp ON (gp.group_id, gp.phone_name) = (i.id, i.name)
WHERE  gp.group_id IS NULL  -- avoid duping existing rows

Concurrency

This form minimizes the chance of a race condition with concurrent write operations. If your table has heavy concurrent write load, you may want to lock the table exclusively or use serializable transaction isolation, This safeguard against the extremely unlikely case that a row is altered by a concurrent transaction in the tiny time slot between the constraint verification (row isn't there) and the write operation in the query.

BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT ...
COMMIT;

Be prepared to repeat the transaction if it rolls back with a serialization error. For more on that topic good starting points could be this blog post by @depesz or this related question on SO.

Normally, though, you needn't even bother with any of this.

Performance

LEFT JOIN tbl ON right_col = left_col WHERE right_col IS NULL

is generally the fastest method with distinct columns in the right table. If you have dupes in the column (especially if there are many),

WHERE NOT EXISTS (SELECT 1 FROM tbl WHERE right_col = left_col)

May be faster because it can stop to scan as soon as the first row is found.

You can also use IN, like @dezso demonstrates, but it is usually slower in PostgreSQL.

Try the following:

INSERT INTO group_phones(group_id, phone_name)
SELECT DISTINCT g.id, p.name 
FROM phones AS p, groups as g
WHERE 
    g.id IN ($add_groups) 
    AND p.name IN ($phones)
    AND (g.id, p.name) NOT IN (
        SELECT group_id, phone_name
        FROM group_phones
    )
;

With DISTINCT you can be sure that unique rows will be inserted, and with the NOT IN clause you exclude already existing rows.

NOTE While this solution is probably easier to understand, in most cases Erwin's will perform better.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!