AWS Glue Truncate Redshift Table

≯℡__Kan透↙ 提交于 2019-12-11 03:38:34

问题


I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.

However, I need to purge the table during this process as I am left with duplicate records after the process completes.

I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.

Thanks.


回答1:


Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.




回答2:


You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.

To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.

Hope this helps.




回答3:


You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):

preactions = "TRUNCATE table <schema.table>" 
df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .option("preactions", preactions)\
  .mode("append")\
  .save()

You can take a look at databricks documentation in here



来源:https://stackoverflow.com/questions/48026111/aws-glue-truncate-redshift-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!