AWS Glue to Redshift: Is it possible to replace, update or delete data?

后端未结

关注

 6  1924

执念已碎

Here are some bullet points in terms of how I have things setup:

I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema.

相关标签:

6条回答

爱一瞬间的悲伤

2020-12-25 13:10

As per my testing (with the same scenario), the BOOKMARK functionality is not working. Duplicate data is getting inserted when the Job is run multiple times. I have got this issue resolved by removing the files from the S3 location daily (through lambda) and implementing Staging & Target tables. data will get insert/update based on the matching key columns.

0 讨论(0)
发布评论:

提交评论
- 加载中...

日久生厌

2020-12-25 13:13

Please check this answer. There is explanation and code sample how to upsert data into Redshift using staging table. The same approach can be used to run any SQL queries before or after Glue writes data using preactions and postactions options:

// Write data to staging table in Redshift
glueContext.getJDBCSink(
  catalogConnection = "redshift-glue-connections-test",
  options = JsonOptions(Map(
    "database" -> "conndb",
    "dbtable" -> staging,
    "overwrite" -> "true",
    "preactions" -> "<another SQL queries>",
    "postactions" -> "<some SQL queries>"
  )),
  redshiftTmpDir = tempDir,
  transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)

0 讨论(0)

生来不讨喜

2020-12-25 13:14

Job bookmarking option in Glue should do the trick , as suggested above . I have been using it successfully when my source is S3. http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

0 讨论(0)
发布评论:

提交评论
- 加载中...

孤独总比滥情好

2020-12-25 13:18

Today I have tested and got a workaround to update/delete from the target table using JDBC connection.

I have used as below

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

import pg8000
args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'PW',
    'HOST',
    'USER',
    'DB'
])
# ...
# Create Spark & Glue context

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# ...
config_port = ****
conn = pg8000.connect(
    database=args['DB'], 
    user=args['USER'], 
    password=args['PW'],
    host=args['HOST'],
    port=config_port
)
query = "UPDATE table .....;"

cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()



query1 = "DELETE  AAA FROM  AAA A, BBB B WHERE  A.id = B.id"

cur1 = conn.cursor()
cur1.execute(query1)
conn.commit()
cur1.close()
conn.close()

0 讨论(0)

情深已故

2020-12-25 13:19

Job bookmarks are the key. Just edit the job and enable "Job bookmarks" and it won't process already processed data. Note that the job has to rerun once before it will detect it does not have to reprocess the old data again.

For more info see: http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

The name "bookmark" is a bit far fetched in my opinion. I would have never looked at it if I did not coincidentally stumble upon it during my search.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2020-12-25 13:20
This was the solution I got from AWS Glue Support:

As you may know, although you can create primary keys, Redshift doesn't enforce uniqueness. Therefore, if you are rerunning Glue jobs then duplicate rows can get inserted. Some of the ways to maintain uniqueness are:
1. Use a staging table to insert all rows and then perform a upsert/merge [1] into the main table, this has to be done outside of glue.
2. Add another column in your redshift table [1], like an insert timestamp, to allow duplicate but to know which one came first or last and then delete the duplicate afterwards if you need to.
3. Load the previously inserted data into dataframe and then compare the data to be insert to avoid inserting duplicates[3]
[1] - http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html and http://www.silota.com/blog/amazon-redshift-upsert-support-staging-table-replace-rows/

[2] - https://github.com/databricks/spark-redshift/issues/238

[3] - https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
0 讨论(0)
发布评论:

提交评论
- 加载中...