How to write data to Redshift that is a result of a dataframe created in Python?

后端 未结 6 851
谎友^
谎友^ 2020-12-14 08:27

I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que

6条回答
  •  星月不相逢
    2020-12-14 08:41

    I tried using pandas df.to_sql() but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See this open issue (as of writing)

    I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into.

    Finally what worked:

    import psycopg2
    
    # Fill in the blanks for the conn object
    conn = psycopg2.connect(user = 'user',
                                  password = 'password',
                                  host = 'host',
                                  dbname = 'db',
                                  port = 666)
    cursor = conn.cursor()
    
    # Adjust ... according to number of columns
    args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
    cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))
    
    cursor.close()
    conn.commit()
    conn.close()
    

    Yep, plain old psycopg2. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. This gave me around 3k rows/minute.

    However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)

提交回复
热议问题