How to write data to Redshift that is a result of a dataframe created in Python?

后端未结

关注

 6  851

谎友^ 2020-12-14 08:27

I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que

6条回答

星月不相逢 (楼主)

2020-12-14 08:41
I tried using pandas df.to_sql() but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See this open issue (as of writing)

I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into.

Finally what worked:
```
import psycopg2

# Fill in the blanks for the conn object
conn = psycopg2.connect(user = 'user',
                              password = 'password',
                              host = 'host',
                              dbname = 'db',
                              port = 666)
cursor = conn.cursor()

# Adjust ... according to number of columns
args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))

cursor.close()
conn.commit()
conn.close()
```
Yep, plain old psycopg2. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. This gave me around 3k rows/minute.

However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...