aws-glue

AWS Glue - Development endpoint price for idle time

╄→гoц情女王★ 提交于 2019-12-11 16:56:20
问题 Is there any pricing charges for an AWS Glue - Developer Endpoint's idle time? Say, I have a developer endpoint configured, and a job is run every day for 30 minutes. Will the pricing be only for the 30 minutes duration every day or including the idle time for AWS Glue - Developer endpoint. Thanks Yuva 回答1: https://aws.amazon.com/glue/pricing/ Development endpoints are optional, and billing applies only if you choose to interactively develop your ETL code. Development endpoints are charged

How to speed up Amazon Athena query executions?

喜夏-厌秋 提交于 2019-12-11 15:51:50
问题 I'm using Athena Query Execution to retrieve data from a Glue Table. A Crawler updates this table every hour using a S3 Bucket which is continuously updated by Kinesis Firehose. My Node.js server executes basic queries using Athena. But I realized that some of the requests takes so long that my server throws Server Request Timeout. I checked the Query History in Athena and I saw some of the latest requests' state is Queued which means they are waiting to be executed. They all have a small Run

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

前提是你 提交于 2019-12-11 15:07:36
问题 I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events. But, I see very narrowed down options only, to trigger a Glue ETL script. Any help on this shall be highly appreciated. 回答1: The following should work to trigger a Glue job from AWS Lambda. Have the lambda configured to the appropriate S3 bucket, and IAM roles / permissions

Query results difference between EMR-Presto and Athena

感情迁移 提交于 2019-12-11 13:05:04
问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

AWS Glue — Access Workflow Parameters from Within Job

与世无争的帅哥 提交于 2019-12-11 12:55:20
问题 How can I retrieve Glue Workflow parameters from within a glue job? I have an AWS Glue job of type "python shell" that is triggered periodically from within a glue workflow. The job's code is to be reused from within a large number of different workflows so I'm looking to retrieve workflow parameters to eliminate the need for redundant jobs. The AWS Developers guide provides the following tutorial: https://docs.aws.amazon.com/glue/latest/dg/workflow-run-properties-code.html But I've been

retrieving s3 path from payload inside AWS glue pythonshell job

别来无恙 提交于 2019-12-11 11:02:14
问题 I have a pythonshell job inside AWS glue that needs to download a file from a s3 path. This s3 path location is a variable so will come to the glue job as a payload in start_run_job call like below: import boto3 payload = {'s3_target_file':s3_TARGET_FILE_PATH, 's3_test_file': s3_TEST_FILE_PATH} job_def = dict( JobName=MY_GLUE_PYTHONSHELL_JOB, Arguments=payload, WorkerType='Standard', NumberOfWorkers=2, ) response = glue.start_job_run(**job_def) My question is, how do I retrieve those s3 paths

AWS Glue DynamicFrames and Push Down Predicate

久未见 提交于 2019-12-11 09:22:17
问题 I am writing an ETL script for AWS Glue that is sourced in S3 stored json files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in: # Define the data restrictor predicate now = str(int(round(time.time() * 1000))) now_minus_7_date = datetime.datetime.now() - datetime.timedelta(days=7) now_minus_7 = str(int(time.mktime(now_minus_7_date.timetuple()) * 1000)) last_7_predicate = "\"timestamp BETWEEN '" + now_minus_7 + "' AND '" + now

Break down a table to pivot in columns (SQL,PYSPARK)

ε祈祈猫儿з 提交于 2019-12-11 08:55:06
问题 I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table : +----+-----+-----+-----+ |year|month|total| loop| +----+-----+-----+-----+ |2012| 1| 20|loop1| |2012| 2| 30|loop1| |2012| 1| 10|loop2| |2012| 2| 5|loop2| |2012| 1| 50|loop3| |2012| 2| 60|loop3| +----+-----+-----+-----+ And I need to get an output like: year month total_loop1 total_loop2 total_loop3 2012 1 20 10 50 2012 2 30 5 60 The closer I have gotten is with the SQL code: select a.year,a.month, a.total,b

AWS Glue Truncate Redshift Table

≯℡__Kan透↙ 提交于 2019-12-11 03:38:34
问题 I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table. However, I need to purge the table during this process as I am left with duplicate records after the process completes. I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated. Thanks. 回答1: Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may

How to specify join types in AWS Glue?

别等时光非礼了梦想. 提交于 2019-12-11 02:29:01
问题 I am using AWS Glue to join two tables. By default, it performs INNER JOIN. I want to do a LEFT OUTER JOIN. I referred the AWS Glue documentation but there is no way to pass the join type to the Join.apply() method. Is there a way to achieve this in AWS Glue? ## @type: Join ## @args: [keys1 = id, keys2 = "user_id"] ## @return: cUser ## @inputs: [frame1 = cUser0, frame2 = cUserLogins] #cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "