aws-glue | 易学教程

When using Relationalize in Glue there is no id in root table

阅读更多关于 When using Relationalize in Glue there is no id in root table

问题 I have a DynamicFrame in Glue and I am using the Relationalize method which creates me 3 new dynamic frames; root_table , root_table_1 and root_table_2 . When I print the Schema of the tables or after I inserted the tables in database I noticed that in the root_table the id is missing so I cannot make joins between the root_table and other tables. I tried all the possible combinations. Is there something i missing? datasource1 = Relationalize.apply(frame = renameId, name = "root_ds",

How can I use a Lambda function to call a Glue function (ETL) when a text file is loaded to an S3 bucket

阅读更多关于 How can I use a Lambda function to call a Glue function (ETL) when a text file is loaded to an S3 bucket

问题 I am trying to set up a lambda function that activates a Glue function when a .txt file is uploaded to an S3 bucket, I am using python 3.7 So far I have this: from __future__ import print_function import json import boto3 import urllib print('Loading function') s3 = boto3.client('s3') def lambda_handler(event, context): # handler source_bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.quote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8')) try: # what to

Filtering nested JSON in AWS Glue

阅读更多关于 Filtering nested JSON in AWS Glue

问题 We would like to use an AWS-Glue Job to filter JSON messages within an s3 bucket. Here is some example JSON: { "property": {"subproperty1": "A", "subproperty2": "B" }} { "property": {"subproperty1": "C", "subproperty2": "D" }} We want to filter on subproperty1 in ["A", "B"] . This is what we try: applyFilter1 = Filter.apply( frame = datasource0, f = lambda x: x["property.subproperty1"] in ["A", "B"] ) Output is then written so a new s3 bucket as follows: datasink2 = glueContext.write_dynamic

Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin

阅读更多关于 Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin

问题 I got started using AWS Glue for my data ETL. I've pulled in my data sources into my AWS data catalog, and am about to create a job for the data from one particular Postgres database I have for testing. I have read online that when authoring your own job, you can use a Zeppelin notebook. I haven't used Zeppelin at all, but have used Jupyter notebook heavily as I'm a python developer, and was using it a lot for data analytics, and machine learning self learnings. I haven't been able to find it

Scheduling data extraction from AWS Redshift to S3

阅读更多关于 Scheduling data extraction from AWS Redshift to S3

问题 I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS. 回答1: Consider using AWS Data Pipeline for this. AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are

Tables not found in Spark SQL after migrating from EMR to AWS Glue

阅读更多关于 Tables not found in Spark SQL after migrating from EMR to AWS Glue

问题 I have Spark jobs on EMR, and EMR is configured to use the Glue catalog for Hive and Spark metadata. I create Hive external tables, and they appear in the Glue catalog, and my Spark jobs can reference them in Spark SQL like spark.sql("select * from hive_table ...") Now, when I try to run the same code in a Glue job, it fails with "table not found" error. It looks like Glue jobs are not using the Glue catalog for Spark SQL the same way that Spark SQL would running in EMR. I can work around

python list of usa holidays between a range

阅读更多关于 python list of usa holidays between a range

问题 I have a need to fetch list of holidays in a given range, i.e., if start date is 20/12/2016 & end date is 10/1/2017, then I should get 25/12/2017, 1/1/2017. I can do this using Pandas, but in my case, I have limitation that I need to AWS Glue service & Pandas are not supported in AWS Glue. I am trying to use native python library holidays, but I couldn't see API document to fetch holidays between from & to date? Here is what I have tried: import holidays import datetime from datetime import

Can a SSE:KMS Key ID be specified when writing to S3 in an AWS Glue Job?

阅读更多关于 Can a SSE:KMS Key ID be specified when writing to S3 in an AWS Glue Job?

问题 If you follow the AWS Glue Add Job Wizard to create a script to write parquet files to S3 you end up with generated code something like this. datasink4 = glueContext.write_dynamic_frame.from_options( frame=dropnullfields3, connection_type="s3", connection_options={"path": "s3://my-s3-bucket/datafile.parquet"}, format="parquet", transformation_ctx="datasink4", ) Is it possible to specify a KMS key so that the data is encrypted in the bucket? 回答1: glue scala job val spark: SparkContext = new

AWS Glue Crawlers and large tables stored in S3

阅读更多关于 AWS Glue Crawlers and large tables stored in S3

问题 I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour. The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in

terraform does not detect changes to lambda source files

阅读更多关于 terraform does not detect changes to lambda source files

问题 In my main.tf I have the following: data "template_file" "lambda_script_temp_file" { template = "${file("../../../fn/lambda_script.py")}" } data "template_file" "library_temp_file" { template = "${file("../../../library.py")}" } data "template_file" "init_temp_file" { template = "${file("../../../__init__.py")}" } data "archive_file" "lambda_resources_zip" { type = "zip" output_path = "${path.module}/lambda_resources.zip" source { content = "${data.template_file.lambda_script_temp_file