How to write data to Redshift that is a result of a dataframe created in Python?

后端未结

关注

 6  850

谎友^ 2020-12-14 08:27

I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que

6条回答

半阙折子戏 (楼主)

2020-12-14 08:58

I used to rely on pandas to_sql() function, but it is just too slow. I have recently switched to doing the following:

import pandas as pd
import s3fs # great module which allows you to read/write to s3 easily
import sqlalchemy

df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])

s3 = s3fs.S3FileSystem(anon=False)
filename = 'my_s3_bucket_name/file.csv'
with s3.open(filename, 'w') as f:
    df.to_csv(f, index=False, header=False)

con = sqlalchemy.create_engine('postgresql://username:password@yoururl.com:5439/yourdatabase')
# make sure the schema for mytable exists

# if you need to delete the table but not the schema leave DELETE mytable
# if you want to only append, I think just removing the DELETE mytable would work

con.execute("""
    DELETE mytable;
    COPY mytable
    from 's3://%s'
    iam_role 'arn:aws:iam::xxxx:role/role_name'
    csv;""" % filename)

the role has to allow redshift access to S3 see here for more details

I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() function

0 讨论(0)

查看其它6个回答