aws-glue

AWS Glue - Truncate destination postgres table prior to insert

浪尽此生 提交于 2019-11-27 18:26:20
问题 I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE. Has anyone been able to do so? 回答1: I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000. Download the tar of pg8000 from pypi Create an empty __init__.py in the root folder Zip up the contents & upload to S3

Partition Athena query by S3 created date

五迷三道 提交于 2019-11-27 06:33:13
问题 I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON. It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query) Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders? edit: I currently

How to set up a local development environment for Scala Spark ETL to run in AWS Glue?

有些话、适合烂在心里 提交于 2019-11-27 03:25:56
问题 I'd like to be able to write Scala in my local IDE and then deploy it to AWS Glue as part of a build process. But I'm having trouble finding the libraries required to build the GlueApp skeleton generated by AWS. The aws-java-sdk-glue doesn't contain the classes imported, and I can't find those libraries anywhere else. Though they must exist somewhere, but perhaps they are just a Java/Scala port of this library: aws-glue-libs The template scala code from AWS: import com.amazonaws.services.glue