I have a dataframe in Python. Can I write this data to Redshift as a new table? I have successfully created a db connection to Redshift and am able to execute simple sql que
I tried using pandas df.to_sql() but it was tremendously slow. It was taking me well over 10 minutes to insert 50 rows. See this open issue (as of writing)
I tried using odo from the blaze ecosystem (as per the recommendations in the issue discussion), but faced a ProgrammingError which I didn't bother to investigate into.
Finally what worked:
import psycopg2
# Fill in the blanks for the conn object
conn = psycopg2.connect(user = 'user',
password = 'password',
host = 'host',
dbname = 'db',
port = 666)
cursor = conn.cursor()
# Adjust ... according to number of columns
args_str = b','.join(cursor.mogrify("(%s,%s,...)", x) for x in tuple(map(tuple,np_data)))
cursor.execute("insert into table (a,b,...) VALUES "+args_str.decode("utf-8"))
cursor.close()
conn.commit()
conn.close()
Yep, plain old psycopg2. This is for a numpy array but converting from a df to a ndarray shouldn't be too difficult. This gave me around 3k rows/minute.
However, the fastest solution as per recommendations from other team mates is to use the COPY command after dumping the dataframe as a TSV/CSV into a S3 cluster and then copying over. You should investigate into this if you're copying really huge datasets. (I will update here if and when I try it out)