Python-PostgreSQL psycopg2 interface --> executemany

廉价感情. 提交于 2019-12-03 14:46:34

just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:

insert into mytable
select * from (
    select distinct * 
    from scratch
) uniq
where not exists (
    select 1 
    from mytable 
    where mytable.mykey = uniq.mykey
);

This will dedup and runs much faster than any combination of inserts.

-dg

I had the same problem and searched here for many days to collect a lot of hints to form a complete solution. Even if the question outdated, I hope this will be useful to others.

1) Forget things about removing indexes/constraints & recreating them later, benefits are marginal or worse.

2) executemany is better than execute as it makes for you the prepare statement. You can get the same results yourself with a command like the following to gain 300% speed:

# To run only once:
sqlCmd = """PREPARE myInsert (int, timestamp, real, text) AS
   INSERT INTO myBigTable (idNumber, date_obs, result, user)
     SELECT $1, $2, $3, $4 WHERE NOT EXISTS
     (SELECT 1 FROM myBigTable WHERE (idNumber, date_obs, user)=($1, $2, $4));"""
curPG.execute(sqlCmd)
cptInsert = 0   # To let you commit from time to time

#... inside the big loop:
curPG.execute("EXECUTE myInsert(%s,%s,%s,%s);", myNewRecord)
allreadyExists = (curPG.rowcount < 1)
if not allreadyExists:
   cptInsert += 1
   if cptInsert % 10000 == 0:
      conPG.commit()

This dummy table example has an unique constraint on (idNumber, date_obs, user).

3) The best solution is to use COPY_FROM and a TRIGGER to manage the unique key BEFORE INSERT. This gave me 36x more speed. I started with normal inserts at 500 records/sec. and with "copy", I got over 18,000 records/sec. Sample code in Python with Psycopg2:

ioResult = StringIO.StringIO() #To use a virtual file as a buffer
cptInsert = 0 # To let you commit from time to time - Memory has limitations
#... inside the big loop:
   print >> ioResult, "\t".join(map(str, myNewRecord))
   cptInsert += 1
   if cptInsert % 10000 == 0:
      ioResult = flushCopyBuffer(ioResult, curPG)
#... after the loop:
ioResult = flushCopyBuffer(ioResult, curPG)

def flushCopyBuffer(bufferFile, cursorObj):
   bufferFile.seek(0)   # Little detail where lures the deamon...
   cursorObj.copy_from(bufferFile, 'myBigTable',
      columns=('idNumber', 'date_obs', 'value', 'user'))
   cursorObj.connection.commit()
   bufferFile.close()
   bufferFile = StringIO.StringIO()
   return bufferFile

That's it for the Python part. Now the Postgresql trigger to not have exception psycopg2.IntegrityError and then all the COPY command's records rejected:

CREATE OR REPLACE FUNCTION chk_exists()
  RETURNS trigger AS $BODY$
DECLARE
    curRec RECORD;
BEGIN
   -- Check if record's key already exists or is empty (file's last line is)
   IF NEW.idNumber IS NULL THEN
      RETURN NULL;
   END IF;
   SELECT INTO curRec * FROM myBigTable
      WHERE (idNumber, date_obs, user) = (NEW.idNumber, NEW.date_obs, NEW.user);
   IF NOT FOUND THEN -- OK keep it
      RETURN NEW;
   ELSE    
      RETURN NULL; -- Oups throw it or update the current record
   END IF;
END;
$BODY$ LANGUAGE plpgsql;

Now link this function to the trigger of your table:

CREATE TRIGGER chk_exists_before_insert
   BEFORE INSERT ON myBigTable FOR EACH ROW EXECUTE PROCEDURE chk_exists();

This seems like a lot of work but Postgresql is a very fast beast when it doesn't have to interpret SQL over and over. Have fun.

"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."

The question doesn't really make a lot of sense.

Does EVERY block of 1,000 rows fail due to non-unique rows?

Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".

Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?

If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.

Please actually run the actual program with actual database and different size blocks and post the numbers.

using a MERGE statement instead of an INSERT one would solve your problem.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!