I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.
Has anyone been able to do so?
I've tried the DROP/ TRUNCATE
scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.
- Download the tar of pg8000 from pypi
- Create an empty
__init__.py
in the root folder - Zip up the contents & upload to S3
- Reference the zip file in the
Python lib path
of the job - Set the DB connection details as job params (make sure to prepend all key names with
--
). Tick the "Server-side encryption" box.
Then you can simply create a connection and execute SQL.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pg8000
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'PW',
'HOST',
'USER',
'DB'
])
# ...
# Create Spark & Glue context
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# ...
config_port = 5432
conn = pg8000.connect(
database=args['DB'],
user=args['USER'],
password=args['PW'],
host=args['HOST'],
port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
After following step (4) of @thenaturalist's response,
sc.addPyFile("/home/glue/downloads/python/pg8000.zip")
import pg8000
worked for me in a development endpoint (zeppelin notebook)
More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
来源:https://stackoverflow.com/questions/47081088/aws-glue-truncate-destination-postgres-table-prior-to-insert