How to write data to Redshift that is a result of a dataframe created in Python?

后端 未结 6 850
谎友^
谎友^ 2020-12-14 08:27

I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que

6条回答
  •  半阙折子戏
    2020-12-14 08:58

    I used to rely on pandas to_sql() function, but it is just too slow. I have recently switched to doing the following:

    import pandas as pd
    import s3fs # great module which allows you to read/write to s3 easily
    import sqlalchemy
    
    df = pd.DataFrame([{'A': 'foo', 'B': 'green', 'C': 11},{'A':'bar', 'B':'blue', 'C': 20}])
    
    s3 = s3fs.S3FileSystem(anon=False)
    filename = 'my_s3_bucket_name/file.csv'
    with s3.open(filename, 'w') as f:
        df.to_csv(f, index=False, header=False)
    
    con = sqlalchemy.create_engine('postgresql://username:password@yoururl.com:5439/yourdatabase')
    # make sure the schema for mytable exists
    
    # if you need to delete the table but not the schema leave DELETE mytable
    # if you want to only append, I think just removing the DELETE mytable would work
    
    con.execute("""
        DELETE mytable;
        COPY mytable
        from 's3://%s'
        iam_role 'arn:aws:iam::xxxx:role/role_name'
        csv;""" % filename)
    
    

    the role has to allow redshift access to S3 see here for more details

    I found that for a 300KB file (12000x2 dataframe) this takes 4 seconds compared to the 8 minutes I was getting with pandas to_sql() function

提交回复
热议问题