AWS Glue Truncate Redshift Table

问题

I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table.

However, I need to purge the table during this process as I am left with duplicate records after the process completes.

I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated.

Thanks.

回答1:

Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may require partitioning to be in place.

回答2:

You need to modify the auto generated code provided by Glue. Connect to redshift using spark jdbc connection and execute the purge query.

To spin up Glue containers in redshift VPC; specify the connection in glue job, to gain access for redshift cluster.

Hope this helps.

回答3:

You can use spark/Pyspark databricks library to do an append after a truncate table of the table (this is better performance than an overwrite):

preactions = "TRUNCATE table <schema.table>" 
df.write\
  .format("com.databricks.spark.redshift")\
  .option("url", redshift_url)\
  .option("dbtable", redshift_table)\
  .option("user", user)\
  .option("password", readshift_password)\
  .option("aws_iam_role", redshift_copy_role)\
  .option("tempdir", args["TempDir"])\
  .option("preactions", preactions)\
  .mode("append")\
  .save()

You can take a look at databricks documentation in here

来源：https://stackoverflow.com/questions/48026111/aws-glue-truncate-redshift-table

标签

python

amazon-web-services

pyspark

amazon-redshift

aws-glue