aws-glue

AWS Glue Custom Classifiers Json Path

北城以北 提交于 2019-12-04 03:33:15
问题 I have a set of Json data files that look like this [ {"client":"toys", "filename":"toy1.csv", "file_row_number":1, "secondary_db_index":"4050", "processed_timestamp":1535004075, "processed_datetime":"2018-08-23T06:01:15+0000", "entity_id":"4050", "entity_name":"4050", "is_emailable":false, "is_txtable":false, "is_loadable":false} ] I have created a Glue Crawler with the following custom classifier Json Path $[*] Glue returns the correct schema with the columns correctly identified. However,

AWS Glue Crawler Not Creating Table

谁都会走 提交于 2019-12-03 22:54:17
I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully completed. CloudWatch log shows: Benchmark: Running Start Crawl for Crawler Benchmark: Classification Complete, writing results to DB Benchmark: Finished writing to Catalog Benchmark: Crawler has finished running and is in ready state I am at a loss as to why the tables in the data catalog are not being created. AWS Docs are not of much help debugging. check the IAM role associated with the crawler

AWS Glue: How to handle nested JSON with varying schemas

跟風遠走 提交于 2019-12-03 18:44:04
问题 Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an

How to run glue script from Glue Dev Endpoint

青春壹個敷衍的年華 提交于 2019-12-03 18:05:35
问题 I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ? I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal. 回答1: You don't need a notebook; you can ssh to the dev endpoint and run it with the

AWS Glue: crawler misinterprets timestamps as strings. GLUE ETL meant to convert strings to timestamps makes them NULL

一曲冷凌霜 提交于 2019-12-03 16:55:18
I have been playing around with AWS Glue for some quick analytics by following the tutorial here While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The date and timestamp data types get read as string data types. I followed this up by creating an ETL job in GLUE using the data source created by the crawler as the input and a target table in Amazon S3 As part of the mapping transformation, I converted the data types of the date and timestamp as string to timestamp but unfortunately the ETL converted

Can I test AWS Glue code locally?

隐身守侯 提交于 2019-12-03 08:50:19
问题 After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code consists of multiple files and packages, all except the main script need to be zipped. All this gives me the feeling that Glue is not suitable for any complex ETL task as development and testing is cumbersome. I could test my Spark code locally without having to upload the code to S3 every time, and

AWS Glue to Redshift: Is it possible to replace, update or delete data?

佐手、 提交于 2019-12-03 05:45:49
问题 Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data from the Glue table to our Amazon Redshift database using a JDBC connection. The Job also is in charge of mapping the columns and creating the redshift table. By re-running a job, I am getting duplicate rows in redshift (as expected). However, is there way to replace or delete rows before

Can we use java for ETL in AWS Glue?

依然范特西╮ 提交于 2019-12-02 12:26:56
Can we use java for ETL in AWS Glue? It seems like there is only two option for Glue ETL programming i.e. Python and Scala. No Q: What programming language can I use to write my ETL code for AWS Glue? You can use either Scala or Python. Resource: AWS Glue FAQ 来源: https://stackoverflow.com/questions/52990462/can-we-use-java-for-etl-in-aws-glue

How to ignore amazon athena struct order

荒凉一梦 提交于 2019-12-02 10:14:18
I'm getting an HIVE_PARTITION_SCHEMA_MISMATCH error that I'm not quite sure what to do about. When I look at the 2 different schemas, the only thing that's different is the order of the keys in one of my structs (created by a glue crawler). I really don't care about the order of the data, and I'm receiving the data as a JSON blob, so I cannot guarantee the order of the keys. struct<device_id:string,user_id:string,payload:array<struct<channel:string,sensor_id:string,type:string,unit:string,value:double,name:string>>,topic:string,channel:string,client_id:string,hardware_id:string,timestamp

Upsert from AWS Glue to Amazon Redshift

Deadly 提交于 2019-12-02 07:17:28
I understand that there is no direct UPSERT query one can perform directly from Glue to Redshift. Is it possible to implement the staging table concept within the glue script itself? So my expectation is creating the staging table, merging it with destination table and finally deleting it. Can it be achieved within the Glue script? Yes, it can be totally achievable. All you would need is to import pg8000 module into your glue job. pg8000 module is the python library which is used to make connection with Amazon Redshift and execute SQL queries through cursor. Python Module Reference: https:/