aws-glue

When using Relationalize in Glue there is no id in root table

谁说胖子不能爱 提交于 2020-01-05 06:08:00
问题 I have a DynamicFrame in Glue and I am using the Relationalize method which creates me 3 new dynamic frames; root_table , root_table_1 and root_table_2 . When I print the Schema of the tables or after I inserted the tables in database I noticed that in the root_table the id is missing so I cannot make joins between the root_table and other tables. I tried all the possible combinations. Is there something i missing? datasource1 = Relationalize.apply(frame = renameId, name = "root_ds",

How can I use a Lambda function to call a Glue function (ETL) when a text file is loaded to an S3 bucket

牧云@^-^@ 提交于 2020-01-04 05:55:27
问题 I am trying to set up a lambda function that activates a Glue function when a .txt file is uploaded to an S3 bucket, I am using python 3.7 So far I have this: from __future__ import print_function import json import boto3 import urllib print('Loading function') s3 = boto3.client('s3') def lambda_handler(event, context): # handler source_bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.quote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8')) try: # what to

Filtering nested JSON in AWS Glue

依然范特西╮ 提交于 2020-01-03 06:03:08
问题 We would like to use an AWS-Glue Job to filter JSON messages within an s3 bucket. Here is some example JSON: { "property": {"subproperty1": "A", "subproperty2": "B" }} { "property": {"subproperty1": "C", "subproperty2": "D" }} We want to filter on subproperty1 in ["A", "B"] . This is what we try: applyFilter1 = Filter.apply( frame = datasource0, f = lambda x: x["property.subproperty1"] in ["A", "B"] ) Output is then written so a new s3 bucket as follows: datasink2 = glueContext.write_dynamic

Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin

可紊 提交于 2020-01-02 07:40:14
问题 I got started using AWS Glue for my data ETL. I've pulled in my data sources into my AWS data catalog, and am about to create a job for the data from one particular Postgres database I have for testing. I have read online that when authoring your own job, you can use a Zeppelin notebook. I haven't used Zeppelin at all, but have used Jupyter notebook heavily as I'm a python developer, and was using it a lot for data analytics, and machine learning self learnings. I haven't been able to find it

Scheduling data extraction from AWS Redshift to S3

匆匆过客 提交于 2019-12-31 01:49:34
问题 I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS. 回答1: Consider using AWS Data Pipeline for this. AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are

Tables not found in Spark SQL after migrating from EMR to AWS Glue

耗尽温柔 提交于 2019-12-25 01:18:40
问题 I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata. I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...") Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR. I can work around

python list of usa holidays between a range

青春壹個敷衍的年華 提交于 2019-12-24 18:43:47
问题 I have a need to fetch list of holidays in a given range, i.e., if start date is 20/12/2016 & end date is 10/1/2017, then I should get 25/12/2017, 1/1/2017. I can do this using Pandas, but in my case, I have limitation that I need to AWS Glue service & Pandas are not supported in AWS Glue. I am trying to use native python library holidays, but I couldn't see API document to fetch holidays between from & to date? Here is what I have tried: import holidays import datetime from datetime import

Can a SSE:KMS Key ID be specified when writing to S3 in an AWS Glue Job?

瘦欲@ 提交于 2019-12-24 10:38:13
问题 If you follow the AWS Glue Add Job Wizard to create a script to write parquet files to S3 you end up with generated code something like this. datasink4 = glueContext.write_dynamic_frame.from_options( frame=dropnullfields3, connection_type="s3", connection_options={"path": "s3://my-s3-bucket/datafile.parquet"}, format="parquet", transformation_ctx="datasink4", ) Is it possible to specify a KMS key so that the data is encrypted in the bucket? 回答1: glue scala job val spark: SparkContext = new

AWS Glue Crawlers and large tables stored in S3

北战南征 提交于 2019-12-24 10:18:40
问题 I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour. The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in

terraform does not detect changes to lambda source files

南楼画角 提交于 2019-12-24 09:23:50
问题 In my main.tf I have the following: data "template_file" "lambda_script_temp_file" { template = "${file("../../../fn/lambda_script.py")}" } data "template_file" "library_temp_file" { template = "${file("../../../library.py")}" } data "template_file" "init_temp_file" { template = "${file("../../../__init__.py")}" } data "archive_file" "lambda_resources_zip" { type = "zip" output_path = "${path.module}/lambda_resources.zip" source { content = "${data.template_file.lambda_script_temp_file