Check if records exists in a Postgres table

前端 未结 2 1017
北荒
北荒 2021-01-07 10:39

I have to read a CSV every 20 seconds. Each CSV contains min. of 500 to max. 60000 lines. I have to insert the data in a Postgres table, but before that I need to check if t

相关标签:
2条回答
  • 2021-01-07 10:47

    First just for completeness I changed Erwin's code to use except

    CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
    COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
    
    INSERT INTO tbl
    SELECT tmp.*
    FROM   tmp
    except
    select *
    from tbl
    
    DROP TABLE tmp;
    

    Then I resolved to test it myself. I tested it in 9.1 with a mostly untouched postgresql.conf. The target table contains 10 million rows and the origin table 30 thousand. 15 thousand already exists in the target table.

    create table tbl (id integer primary key)
    ;
    insert into tbl
    select generate_series(1, 10000000)
    ;
    create temp table tmp as select * from tbl limit 0
    ;
    insert into tmp
    select generate_series(9985000, 10015000)
    ;
    

    I asked for the explain of the select part only. The except version:

    explain
    select *
    from tmp
    except
    select *
    from tbl
    ;
                                           QUERY PLAN                                       
    ----------------------------------------------------------------------------------------
     HashSetOp Except  (cost=0.00..270098.68 rows=200 width=4)
       ->  Append  (cost=0.00..245018.94 rows=10031897 width=4)
             ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..771.40 rows=31920 width=4)
                   ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
             ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..244247.54 rows=9999977 width=4)
                   ->  Seq Scan on tbl  (cost=0.00..144247.77 rows=9999977 width=4)
    (6 rows)
    

    The outer join version:

    explain
    select *
    from 
        tmp
        left join
        tbl using (id)
    where tbl.id is null
    ;
                                    QUERY PLAN                                
    --------------------------------------------------------------------------
     Nested Loop Anti Join  (cost=0.00..208142.58 rows=15960 width=4)
       ->  Seq Scan on tmp  (cost=0.00..452.20 rows=31920 width=4)
       ->  Index Scan using tbl_pkey on tbl  (cost=0.00..7.80 rows=1 width=4)
             Index Cond: (tmp.id = id)
    (4 rows)
    
    0 讨论(0)
  • 2021-01-07 11:04

    This should perform well:

    CREATE TEMP TABLE tmp AS SELECT * FROM tbl LIMIT 0 -- copy layout, but no data
    
    COPY tmp FROM '/absolute/path/to/file' FORMAT csv;
    
    INSERT INTO tbl
    SELECT tmp.*
    FROM   tmp
    LEFT   JOIN tbl USING (tbl_id)
    WHERE  tbl.tbl_id IS NULL;
    
    DROP TABLE tmp; -- else dropped at end of session automatically
    

    Closely related to this answer.

    0 讨论(0)
提交回复
热议问题