aws-glue | 易学教程

AWS Glue - Development endpoint price for idle time

阅读更多关于 AWS Glue - Development endpoint price for idle time

问题 Is there any pricing charges for an AWS Glue - Developer Endpoint's idle time? Say, I have a developer endpoint configured, and a job is run every day for 30 minutes. Will the pricing be only for the 30 minutes duration every day or including the idle time for AWS Glue - Developer endpoint. Thanks Yuva 回答1: https://aws.amazon.com/glue/pricing/ Development endpoints are optional, and billing applies only if you choose to interactively develop your ETL code. Development endpoints are charged

How to speed up Amazon Athena query executions?

阅读更多关于 How to speed up Amazon Athena query executions?

问题 I'm using Athena Query Execution to retrieve data from a Glue Table. A Crawler updates this table every hour using a S3 Bucket which is continuously updated by Kinesis Firehose. My Node.js server executes basic queries using Athena. But I realized that some of the requests takes so long that my server throws Server Request Timeout. I checked the Query History in Athena and I saw some of the latest requests' state is Queued which means they are waiting to be executed. They all have a small Run

How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

阅读更多关于 How to Trigger Glue ETL Pyspark job through S3 Events or AWS Lambda?

问题 I'm planning to write certain jobs in AWS Glue ETL using Pyspark, which I want to get triggered as and when a new file is dropped in an AWS S3 Location, just like we do for triggering AWS Lambda Functions using S3 Events. But, I see very narrowed down options only, to trigger a Glue ETL script. Any help on this shall be highly appreciated. 回答1: The following should work to trigger a Glue job from AWS Lambda. Have the lambda configured to the appropriate S3 bucket, and IAM roles / permissions

Query results difference between EMR-Presto and Athena

阅读更多关于 Query results difference between EMR-Presto and Athena

问题 I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join , group by and a count distinct . The query looks like this: select t1.customer_id as id, t2.purchase_date as purchase_date, count(distinct t1.purchase_id) as item_count from table1 t1 left join table2 as t2 on t2.purchase_id=t1.purchase_id

AWS Glue — Access Workflow Parameters from Within Job

阅读更多关于 AWS Glue — Access Workflow Parameters from Within Job

问题 How can I retrieve Glue Workflow parameters from within a glue job? I have an AWS Glue job of type "python shell" that is triggered periodically from within a glue workflow. The job's code is to be reused from within a large number of different workflows so I'm looking to retrieve workflow parameters to eliminate the need for redundant jobs. The AWS Developers guide provides the following tutorial: https://docs.aws.amazon.com/glue/latest/dg/workflow-run-properties-code.html But I've been

retrieving s3 path from payload inside AWS glue pythonshell job

阅读更多关于 retrieving s3 path from payload inside AWS glue pythonshell job

问题 I have a pythonshell job inside AWS glue that needs to download a file from a s3 path. This s3 path location is a variable so will come to the glue job as a payload in start_run_job call like below: import boto3 payload = {'s3_target_file':s3_TARGET_FILE_PATH, 's3_test_file': s3_TEST_FILE_PATH} job_def = dict( JobName=MY_GLUE_PYTHONSHELL_JOB, Arguments=payload, WorkerType='Standard', NumberOfWorkers=2, ) response = glue.start_job_run(**job_def) My question is, how do I retrieve those s3 paths

AWS Glue DynamicFrames and Push Down Predicate

阅读更多关于 AWS Glue DynamicFrames and Push Down Predicate

问题 I am writing an ETL script for AWS Glue that is sourced in S3 stored json files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in: # Define the data restrictor predicate now = str(int(round(time.time() * 1000))) now_minus_7_date = datetime.datetime.now() - datetime.timedelta(days=7) now_minus_7 = str(int(time.mktime(now_minus_7_date.timetuple()) * 1000)) last_7_predicate = "\"timestamp BETWEEN '" + now_minus_7 + "' AND '" + now

Break down a table to pivot in columns (SQL,PYSPARK)

阅读更多关于 Break down a table to pivot in columns (SQL,PYSPARK)

问题 I'm working in an environment pyspark with python3.6 in AWS Glue. I have this table : +----+-----+-----+-----+ |year|month|total| loop| +----+-----+-----+-----+ |2012| 1| 20|loop1| |2012| 2| 30|loop1| |2012| 1| 10|loop2| |2012| 2| 5|loop2| |2012| 1| 50|loop3| |2012| 2| 60|loop3| +----+-----+-----+-----+ And I need to get an output like: year month total_loop1 total_loop2 total_loop3 2012 1 20 10 50 2012 2 30 5 60 The closer I have gotten is with the SQL code: select a.year,a.month, a.total,b

AWS Glue Truncate Redshift Table

阅读更多关于 AWS Glue Truncate Redshift Table

问题 I have created a Glue job that copies data from S3 (csv file) to Redshift. It works and populates the desired table. However, I need to purge the table during this process as I am left with duplicate records after the process completes. I'm looking for a way to add this purge to the Glue process. Any advice would be appreciated. Thanks. 回答1: Did you have a look at Job Bookmarks in Glue? It's a feature for keeping the high water mark and works with s3 only. I am not 100% sure, but it may

How to specify join types in AWS Glue?

阅读更多关于 How to specify join types in AWS Glue?

问题 I am using AWS Glue to join two tables. By default, it performs INNER JOIN. I want to do a LEFT OUTER JOIN. I referred the AWS Glue documentation but there is no way to pass the join type to the Join.apply() method. Is there a way to achieve this in AWS Glue? ## @type: Join ## @args: [keys1 = id, keys2 = "user_id"] ## @return: cUser ## @inputs: [frame1 = cUser0, frame2 = cUserLogins] #cUser = Join.apply(frame1 = cUser0, frame2 = +, keys1 = "id", keys2 = "user_id", transformation_ctx = "