aws-glue

Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?

浪子不回头ぞ 提交于 2019-12-05 00:41:42
问题 Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, that causes AWS Glue crawler to analyse it. I know that there is schedule based crawling, but never found an event- based one. 回答1: No, there is currently no direct way to invoke an AWS Glue crawler in response to an upload to an S3 bucket. S3 event notifications can only be sent to: SNS SQS Lambda

How to move data from Glue to Dynamodb

戏子无情 提交于 2019-12-05 00:28:53
问题 We are designing an Big data solution for one of our dashboard applications and seriously considering Glue for our initial ETL. Currently Glue supports JDBC and S3 as the target but our downstream services and components will work better with dynamodb. We are wondering what is the best approach to eventually move the records from Glue to Dynamo. Should we write to S3 first and then run lambdas to insert the data into Dynamo? Is that the best practice? OR Should we use a third party JDBC

Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?

别说谁变了你拦得住时间么 提交于 2019-12-04 20:26:09
I need to copy a zipped file from one AWS S3 folder to another and would like to make that a scheduled AWS Glue job. I cannot find an example for such a simple task. Please help if you know the answer. May be the answer is in AWS Lambda, or other AWS tools. Thank you very much! You can do this, and there may be a reason to use AWS Glue: if you have chained Glue jobs and glue_job_#2 is triggered on the successful completion of glue_job_#1 . The simple Python script below moves a file from one S3 folder ( source ) to another folder ( target ) using the boto3 library, and optionally deletes the

How configure glue bookmars to work with scala code?

微笑、不失礼 提交于 2019-12-04 20:00:05
Consider scala code: import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.util.{GlueArgParser, Job, JsonOptions} import org.apache.spark.SparkContext import scala.collection.JavaConverters.mapAsJavaMapConverter object MyGlueJob { def main(sysArgs: Array[String]) { val spark: SparkContext = SparkContext.getOrCreate() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val input = glueContext .getCatalogSource(database = "my_data_base",

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

。_饼干妹妹 提交于 2019-12-04 19:28:22
According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running . Here is what would normally see in Athena hisotry tab: I understand that, after I

Adding timestamp column in importing data in redshift using AWS Glue Job

感情迁移 提交于 2019-12-04 18:39:18
I would like to know if it is possible to add a timestamp column in a table when it is loaded by an AWS Glue Job. First Scenario: Column A | Column B| TimeStamp A|2|2018-06-03 23:59:00.0 When a Crawler updates the table in the data catalog and run the job again, the table will add the new data in the table with a new time stamp.. Column A | Column B| TimeStamp A|4|2018-06-04 05:01:31.0 B|8|2018-06-04 06:02:31.0 import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

僤鯓⒐⒋嵵緔 提交于 2019-12-04 11:16:23
I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided. AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html Observations: - Using spark-shell (From EMR Master Node): Works . Able to access Glue DB

How to ignore amazon athena struct order

 ̄綄美尐妖づ 提交于 2019-12-04 06:29:00
问题 I'm getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I'm not quite sure what to do about. When I look at the 2 different schemas, the only thing that's different is the order of the keys in one of my structs (created by a glue crawler). I really don't care about the order of the data, and I'm receiving the data as a JSON blob, so I cannot guarantee the order of the keys. struct<device_id:string,user_id:string,payload:array<struct<channel:string,sensor_id:string,type:string,unit:string

How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column

℡╲_俬逩灬. 提交于 2019-12-04 05:08:13
问题 How to I add a current timestamp (extra column) in the glue job so that the output data has an extra column. In this case: Schema Source Table: Col1, Col2 After Glue job. Schema of Destination: Col1, Col2, Update_Date(Current Timestamp) 回答1: I'm not sure if there's a glue native way to do this with the DynamicFrame , but you can easily convert to a Spark Dataframe and then use the withColumn method. You will need to use the lit function to put literal values into a new column, as below. from

Overwrite parquet files from dynamic frame in AWS Glue

人盡茶涼 提交于 2019-12-04 04:41:27
I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files? Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature. As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using