I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que
I used to rely on pandas to_sql() function, but it is just too slow. I have recently switched to doing the following:
import pandas as pd
import s3fs # great module which allows you to read/write to s3 easily
import sqlalchemy
df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
s3 = s3fs.S3FileSystem(anon=False)
filename = 'my_s3_bucket_name/file.csv'
with s3.open(filename, 'w') as f:
df.to_csv(f, index=False, header=False)
con = sqlalchemy.create_engine('postgresql://username:password@yoururl.com:5439/yourdatabase')
# make sure the schema for mytable exists
# if you need to delete the table but not the schema leave DELETE mytable
# if you want to only append, I think just removing the DELETE mytable would work
con.execute("""
DELETE mytable;
COPY mytable
from 's3://%s'
iam_role 'arn:aws:iam::xxxx:role/role_name'
csv;""" % filename)
the role has to allow redshift access to S3 see here for more details
I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() function