Python write DateFrame to AWS redshift using psycopg2

问题

I want to update a table in AWS on a daily basis, what I plan to do is to delete data/rows in a public table in AWS using Python psycopg2 first, then insert a python dataframe data into that table.

import psycopg2
import pandas as pd

con=psycopg2.connect(dbname= My_Credential.....)
cur = con.cursor()

sql = """
DELETE FROM tableA
"""

cur.execute(sql)
con.commit()

the above code can do the delete, but I don't know how to write python code to insert My_Dataframe to the tableA. TableA size is around 1 millions rows to 5 millions, please advise.

回答1:

I agree with what @mdem7 has suggested in comment, inserting 1-5 million data using dataframe is not a good idea at all and you will face performance issues.

Its better to use the S3 to Redshift load approach. Here goes your code to do both Truncate and Copy command.

import psycopg2


def redshift():

    conn = psycopg2.connect(dbname='database_name', host='888888888888****.u.****.redshift.amazonaws.com', port='5439', user='username', password='********')
    cur = conn.cursor();

    cur.execute("truncate table example;")

    //Begin your transaction
    cur.execute("begin;")
    cur.execute("copy example from 's3://examble-bucket/example.csv' credentials 'aws_access_key_id=ID;aws_secret_access_key=KEY/KEY/pL/KEY' csv;")
    ////Commit your transaction
    cur.execute("commit;")
    print("Copy executed fine!")

redshift();

There are even more ways to make Copy faster in Menifest option, so that Redshift could load the data in parallel. Hope this give you some idea to move.

回答2:

Any suggestions on, how to pass connection string in place of connection details in below conde : -

conn = psycopg2.connect(dbname ='' ,host='' ...) 
i am looking to pass like this ..

conn = psycopg2.connect('Connection_String')

来源：https://stackoverflow.com/questions/53891593/python-write-dateframe-to-aws-redshift-using-psycopg2

标签

python

python-3.x

amazon-web-services

amazon-redshift